ringgaard / sling

SLING - A natural language frame semantics parser
Apache License 2.0
154 stars 11 forks source link

How to run the silver annotation pipeline #5

Open foolfun opened 3 years ago

foolfun commented 3 years ago

I ran the DrKIT code which includes 'sling/local/data/distant/facts-0000%d-of-00010.json', I have no idea how to get it?

Deriq-Qian-Dong commented 3 years ago

+1

ringgaard commented 3 years ago

The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:

sling build_idf

Then you can run silver annotation on all the Wikipedia articles:

sling silver_annotation

It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.

Deriq-Qian-Dong commented 3 years ago

Hi, thanks for your reply. When I run sling silver_annotation, I got the error massage: [2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0 do you have any idea about this error?

ringgaard commented 3 years ago

I remember having seen this error before. Let me check if there are some changes from the dev branch that I haven't submitted to the master branch.

foolfun commented 3 years ago

The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:

sling build_idf

Then you can run silver annotation on all the Wikipedia articles:

sling silver_annotation

It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.

Thanks for your reply! When I ran ‘ sling fuse_items’ met https://github.com/ringgaard/sling/issues/4. I have no idea why it happened, can you help me?

ringgaard commented 3 years ago

It seems like I will have to do a complete test run of the wiki and silver annotation pipelines. I run these in a slightly different mode using wiki snapshots to get a wikidata dump and the reconciler for fusing items. It seems like there is some bug in the old pipeline.

You should check if you have enough disk space. You will need something like 500 GB free space on you hard drive including your temp directory (usually /tmp). There has been reports about out-of-disk-space is not always reported correctly. You should also check that you don't have a bunch of temp files from runs that crashed. You can remove old temp files using this command:

rm -r /tmp/local.*

It is going to take a while to rerunning the pipelines, so please be patient. I will try to do this over the weekend. I have a server upgrade Sunday which will also delay this.

foolfun commented 3 years ago

It seems like I will have to do a complete test run of the wiki and silver annotation pipelines. I run these in a slightly different mode using wiki snapshots to get a wikidata dump and the reconciler for fusing items. It seems like there is some bug in the old pipeline.

You should check if you have enough disk space. You will need something like 500 GB free space on you hard drive including your temp directory (usually /tmp). There has been reports about out-of-disk-space is not always reported correctly. You should also check that you don't have a bunch of temp files from runs that crashed. You can remove old temp files using this command:

rm -r /tmp/local.*

It is going to take a while to rerunning the pipelines, so please be patient. I will try to do this over the weekend. I have a server upgrade Sunday which will also delay this.

Thank you so much. I will take your advice to try it again.

foolfun commented 3 years ago

I have enough disk space, remove the old temp files and run the command follow. However, it seems that I met the same problem again. Looking forward to your reply

export TMPDIR=/mnt/hdd1/tmp

sling build_wiki --lbzip2 --languages en

image

Deriq-Qian-Dong commented 3 years ago

@foolfun try sling build_wiki. Withou lbzip2 and languages. It woks for me.

ringgaard commented 3 years ago

I think I managed to fix the error that caused fuse_items to crash, so if you sync to HEAD you should be able to run the wiki pipeline. See this commit.

You can just resume from the fuse_items stage, so you don't need to re-run the whole wiki pipeline again:

sling fuse_items build_kb extract_names build_nametab build_phrasetab

Next, I will try to see if I can reproduce the CHECK fault in the silver annotation pipeline: [2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0

foolfun commented 3 years ago

I think I managed to fix the error that caused fuse_items to crash, so if you sync to HEAD you should be able to run the wiki pipeline. See this commit.

You can just resume from the fuse_items stage, so you don't need to re-run the whole wiki pipeline again:

sling fuse_items build_kb extract_names build_nametab build_phrasetab

Next, I will try to see if I can reproduce the CHECK fault in the silver annotation pipeline: [2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0

it works! I have been troubled by the issue for nearly two weeks, thank you very much!

foolfun commented 3 years ago

The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:

sling build_idf

Then you can run silver annotation on all the Wikipedia articles:

sling silver_annotation

It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.

I have run the silver annotation pipeline and the result shows in the following picture. However, I still can not find 'local/data/e/silver/en/silver-00000-of-00010.rec'. I don`t know whether I miss some important steps. Can you help me? image

By the way, the files I can find are: image

ringgaard commented 3 years ago

The output looks correct. The silver-annotated Wikipedia documents are in train-*.rec and eval.rec. Together these contain all the Wikipedia articles. They are split into train and eval because I use this data as noisy training data for the semantic parser.

NB: I did a complete run of the silver annotation pipeline and it did not get the Check failed: num >= 0 error. This error could be due to Wikidata errors in the date items. My version of Wikidata is from Nov 25.

foolfun commented 3 years ago

The output looks correct. The silver-annotated Wikipedia documents are in train-*.rec and eval.rec. Together these contain all the Wikipedia articles. They are split into train and eval because I use this data as noisy training data for the semantic parser.

NB: I did a complete run of the silver annotation pipeline and it did not get the Check failed: num >= 0 error. This error could be due to Wikidata errors in the date items. My version of Wikidata is from Nov 25.

I can not find 'e/silver/en/silver-0000%d-of-00010.rec' . How can I get the file?

ringgaard commented 3 years ago

From where did you get the impression that the silver annotations should be in e/silver/en/silver-0000%d-of-00010.rec?

The silver annotations are in local/data/e/silver/en/train-?????-of-00010.rec and local/data/e/silver/en/eval.rec. You can take a look at the data with the codex tool:

bin/codex data/e/silver/en/train-00000-of-00010.rec | less

Each record is a Wikipedia article and contains the title, the raw text, the tokens, and the mentions with evoked frames.

Deriq-Qian-Dong commented 3 years ago

Hi, Ringgaard, I'm using google/sling to get sliver annotation. And I get the problem "Check failed: num >= 0"``. Cause I don't have a SUDO right to build sling in your repository. Do you have any idea about how to deal with this?

foolfun commented 3 years ago

From where did you get the impression that the silver annotations should be in e/silver/en/silver-0000%d-of-00010.rec?

The silver annotations are in local/data/e/silver/en/train-?????-of-00010.rec and local/data/e/silver/en/eval.rec. You can take a look at the data with the codex tool:

bin/codex data/e/silver/en/train-00000-of-00010.rec | less

Each record is a Wikipedia article and contains the title, the raw text, the tokens, and the mentions with evoked frames.

Sorry,I try to run distantly_supervise.py which needs silver-0000%d-of-00010.rec in line 543. It is why I want to consult you about the way to get this file.

I tried to replace silver-0000%d-of-00010.rec with train-0000%d-of-00010.rec, but it showed line 348 kb_item gets None. Then, I guess this way may not work. Do you have any idea about how to deal with this?

Deriq-Qian-Dong commented 3 years ago

image when using the google/sling, I got the sliver-* files. But it's not correct because it's not processed completely.

ringgaard commented 3 years ago

The problem seems to be that the distantly_supervise.py script expects the silver data to be indexed by QIDs but the silver pipeline assigns random keys in order to shuffle the data set for training. How many documents do you need to extract? Is it all of them or just a small subset?

Deriq-Qian-Dong commented 3 years ago

all of them, I think

ringgaard commented 3 years ago

There are basically two solutions: either take the train and eval files and reindex them, or make a new silver workflow that is compatible with the old mode.

Let me first check out how difficult it would be to make a custom silver workflow that produces the output that distantly_supervise.py expects.

Deriq-Qian-Dong commented 3 years ago

Pretty thanks a lot!

ringgaard commented 3 years ago

With the Python script below you should be able to produce the silver-*.rec output that should be compatible with distantly_supervise.py:

import sling
import sling.flags as flags
import sling.log as log
import sling.task.workflow as workflow
import sling.task.wiki as wiki
import sling.task.corpora as corpora

flags.parse()
workflow.startup()

language = flags.arg.language
workdir = flags.arg.workdir + "/silver/" + language

wf = workflow.Workflow("silver")
wikiwf = wiki.WikiWorkflow(wf=wf)

indocs = wikiwf.wikipedia_documents(language)
outdocs = wf.resource("silver@10.rec", dir=workdir, format="records/document")
idf = wf.resource("idf.repo", dir=workdir, format="repository")

config = corpora.repository("data/wiki/" + language + "/silver.sling")
phrases = corpora.repository("data/wiki/" + language) + "/phrases.txt"

mapper = wf.task("document-processor", "labeler")
mapper.add_annotator("mentions")
mapper.add_annotator("anaphora")
mapper.add_annotator("phrase-structure")
mapper.add_annotator("relations")

mapper.add_param("resolve", True)
mapper.add_param("language", language)

mapper.attach_input("commons", wikiwf.knowledge_base())
mapper.attach_input("commons", wf.resource(config, format="store/frame"))

mapper.attach_input("aliases", wikiwf.phrase_table(language))
mapper.attach_input("dictionary", idf)
mapper.attach_input("phrases", wf.resource(phrases, format="lex"))

wf.connect(wf.read(indocs), mapper)
output = wf.channel(mapper, format="message/document")
wf.write(output, outdocs)

workflow.run(wf)
workflow.shutdown()

You can check the output with this command:

bin/codex --lex local/data/e/silver/en/silver* 
ringgaard commented 3 years ago

Hmm... My test run seems to indicate that the script above does not read the stopword and blacklists correctly, resulting in many spammy annotations. Let me try to fix this.

Deriq-Qian-Dong commented 3 years ago

image

Emm...When I run this script, I got the same error.

ringgaard commented 3 years ago

Is there a stack trace below the "Check failed:" line?

Deriq-Qian-Dong commented 3 years ago

(core dumped)

ringgaard commented 3 years ago

The CHECK fault indicates that some invalid date is being processed. You could just comment out the CHECK in line 41 of calendar.cc. It would cause some invalid dates in the output annotations, but without further information, I don't know how to fix this.

ringgaard commented 3 years ago

I have updated the Python script above to include the configuration of stopwords and blacklists. The following lines were missing:

config = corpora.repository("data/wiki/" + language + "/silver.sling")
mapper.attach_input("commons", wf.resource(config, format="store/frame"))

This should remove a lot of spammy annotations for common words and phrases.

foolfun commented 3 years ago

Hi, ringgaard! When I ran the script, I met this problem: image

ringgaard commented 3 years ago

I assume that you run the script from the root directory of the git repo. The silver.sling file is checked into the master branch of the repo here. I don't understand why you don't have this file.

foolfun commented 3 years ago

Hi, ringgaard! I have run the script but I got a err: image

ringgaard commented 3 years ago

I haven't been able to reproduce the Check failed: num >= 0 error yet, so I think the best option for now is to replace line 41 in sling/nlp/kb/calendar.cc with:

    DCHECK(num >= 0);

and rebuild the code (tools/buildall.sh). This could result in some bad date annotations, but it would allow you to get on with producing the silver annotations.

foolfun commented 3 years ago

Hello ringgaard. Sorry for disturbing you. I tried many times, but it seems the same error still happened.

ringgaard commented 3 years ago

@foolfun: Are you still getting the same CHECK fault in sling/nlp/kb/calendar.cc line 41 although you have changed it to a DCHECK, which can only happen in debug mode? Are you sure you recompiled the code using tools/buildall.sh?

foolfun commented 3 years ago

@foolfun: Are you still getting the same CHECK fault in sling/nlp/kb/calendar.cc line 41 although you have changed it to a DCHECK, which can only happen in debug mode? Are you sure you recompiled the code using tools/buildall.sh?

Yes, I did these steps and the same error happened. I find the error may be related to my SLING Python API and python environment, I reconfigured them and the script has run half an hour without any error until now. Thank you for your patience.

ringgaard commented 3 years ago

It can sometimes be confusing which version of the SLING Python API you are using if you switch between using pip and downloading and installing the code yourself. You can check where it is installed like this:

$ python3 -c "import sling; print(sling)"
<module 'sling' from '/usr/lib/python3/dist-packages/sling/__init__.py'>
$ ls -l /usr/lib/python3/dist-packages/sling
lrwxrwxrwx 1 root root 26 Oct  1 17:54 /usr/lib/python3/dist-packages/sling -> /home/michael/sling/python

In "developer mode" it is important that the python package directory (/usr/lib/python3/dist-packages/sling) is a symlink to the python directory in your repository directory (/home/michael/sling/python). Otherwise, recompilation has no effect.

foolfun commented 3 years ago

It can sometimes be confusing which version of the SLING Python API you are using if you switch between using pip and downloading and installing the code yourself. You can check where it is installed like this:

$ python3 -c "import sling; print(sling)"
<module 'sling' from '/usr/lib/python3/dist-packages/sling/__init__.py'>
$ ls -l /usr/lib/python3/dist-packages/sling
lrwxrwxrwx 1 root root 26 Oct  1 17:54 /usr/lib/python3/dist-packages/sling -> /home/michael/sling/python

In "developer mode" it is important that the python package directory (/usr/lib/python3/dist-packages/sling) is a symlink to the python directory in your repository directory (/home/michael/sling/python). Otherwise, recompilation has no effect.

Get it! Thank you!

ringgaard commented 3 years ago

Since the distantly_supervise.py script does random lookups in the silver data set, it might be useful to index this to make it faster:

bin/index local/data/e/silver/en/silver*

PS: My run of the silver pipeline completed successfully.

ringgaard commented 3 years ago

I have made prebuilt version of the knowledge base and alias tables available on the ringgaard.com web site. You can use the sling command to download these, e.g.:

sling fetch --dataset kb,phrasetab