nyu-mll / jiant

jiant is an nlp toolkit
https://jiant.info
MIT License
1.64k stars 297 forks source link

[ZeroDivisionError at ontonotes dataset in edge-probing task] #950

Closed minstar closed 4 years ago

minstar commented 4 years ago

Hi, I have some problem with preprocessing on edge probing tasks. I have correctly downloaded the "conll-formatted-ontonotes-5.0". However, in the preprocessing step, stats["count"] is zero so I cannot proceed and below is my error log.

File "/path-to-jiant/probing/data/utils.py", line 83, in to_series
    s["token.mean_count"] = stats["token.count"] / stats["count"] 
ZeroDivisionError: division by zero

by following all README.md in data edge-probing/data folder, I think allennlp's ontonotes dataset_iterator doesn't working. Thus, it doesn't generate any sentences.

Is there any solution I can solve this?

liebscher commented 4 years ago

I am getting the same error despite following the readme as close as possible.

Just pulled the repo the other day.

pruksmhc commented 4 years ago

@iftenney

iftenney commented 4 years ago

I'm out of the office now, but will debug this later this week.

minstar commented 4 years ago

Is there any change that I can try this now??

iftenney commented 4 years ago

Sorry for the delay. Working on this, if you need a copy in the mean time please email me (iftenney -at- gmail).

iftenney commented 4 years ago

Can you give the exact command you used, and any output before the error message?

I'm not able to reproduce. Tested the following:

# set up a fresh environment
git clone --recursive git@github.com:nyu-mll/jiant.git jiant
cd jiant
conda env create -f environment.yml
conda activate jiant

# process each OntoNotes task
python probing/data/extract_ontonotes_all.py --ontonotes ~/data/conll-formatted-ontonotes-5.0 --tasks=coref -o /tmp/onto_coref
python probing/data/extract_ontonotes_all.py --ontonotes ~/data/conll-formatted-ontonotes-5.0 --tasks=ner -o /tmp/onto_ner
python probing/data/extract_ontonotes_all.py --ontonotes ~/data/conll-formatted-ontonotes-5.0 --tasks=const -o /tmp/onto_const
python probing/data/extract_ontonotes_all.py --ontonotes ~/data/conll-formatted-ontonotes-5.0 --tasks=srl -o /tmp/onto_srl

and didn't see a crash.

minstar commented 4 years ago

My allennlp version is same as yours and I have changed my path to ontonotes at the get_and_process_all_data.sh as you did. Also, my conll-formatted-ontonotes-5.0 contains data/conll-2012-test, data/development, data/train, data/test. Each data file has annotations of bc, bn, mz, nw, pt, tc, wb folder. However, I tried the first extract_ontonotes_all.py codes which is python probing/data/extract_ontonotes_all.py --ontonotes ~/data/conll-formatted-ontonotes-5.0 --tasks=coref -o /tmp/onto_coref but it shows same error as above.

iftenney commented 4 years ago

Does your OntoNotes folder have the .conll files?

cd conll-formatted-ontonotes-5.0
ls data/train/data/english/annotations/bc/cctv/00

Should look something like:

cctv_0001.gold_conll  cctv_0002.gold_skel   cctv_0004.gold_conll
cctv_0001.gold_skel   cctv_0003.gold_conll  cctv_0004.gold_skel
cctv_0002.gold_conll  cctv_0003.gold_skel

The LDC corpus doesn't include the .conll files by default, they have to be generated by step 3 from http://cemantix.org/data/ontonotes.html.

(Also see the AllenNLP corpus reader documentation at https://github.com/allenai/allennlp/blob/v0.8.4/allennlp/data/dataset_readers/dataset_utils/ontonotes.py#L83)

minstar commented 4 years ago

Oh.. now I solved my problem Thanks a lot!!

iftenney commented 4 years ago

Great, glad that helped! Please let me know if you have any other questions.

sugeeth14 commented 4 years ago

Hi I too ran into the same issue. I am trying to do step 3 from here http://cemantix.org/data/ontonotes.html But from where can I download the scripts skeleton2conll.sh The download link in the page is either not there or seems broken. Is there any solution ?

sugeeth14 commented 4 years ago

Okay so after some search found the scripts here seems the download link is broken in the website.

sugeeth14 commented 4 years ago

Hi, I found the scripts but I am running into an issue saying could not find the gold parse [conll-formatted-ontonotes-5.0/data/train/data/english/annotations/mz/sinorama/10/ectb_1031.parse] in the ontonotes distribution ... exiting ...

These are steps I did

  1. Dowloaded v12 from here

  2. Uncompressed the file to get conll-formatted-ontonotes-5.0-12 which had conll-formatted-ontonotes-5.0

  3. I didls conll-formatted-ontonotes-5.0/data/train/data/english/annotations/bc/cctv/00 to get cctv_0001.gold_skel cctv_0002.gold_skel cctv_0003.gold_skel cctv_0004.gold_skel which didn't have the .conll files files so tried following step 3 from http://cemantix.org/data/ontonotes.html by downloading scripts from https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO/tree/master/conll-formatted-ontonotes-5.0/scripts and placing them inconll-formatted-ontonotes-5.0-12/scriptsfolder .

4.Tried runningbash conll-formatted-ontonotes-5.0/scripts/skeleton2conll.sh -D conll-formatted-ontonotes-5.0/data/train/data/ conll-formatted-ontonotes-5.0/but am getting the above error. Am I doing any wrong here ? Please help @minstar

lovodkin93 commented 4 years ago

so i used the scripts Raghava14 provided, which were described in http://cemantix.org/data/ontonotes.html, and got the following error:

File "../../conll-formatted-ontonotes-5.0/scripts/skeleton2conll.py", line 392 except InvalidSexprException, e: ^ SyntaxError: invalid syntax Exit code: 1 ./skeleton2conll.sh: line 93: break: only meaningful in a for',while', or `until' loop -> python ../../conll-formatted-ontonotes-5.0/scripts/skeleton2conll.py ../../../ontonotes-release-5.0/data/files/data/english/annotations/mz/sinorama/10/ectb_1029.parse ../../conll-formatted-ontonotes-5.0/data/conll-2012-test/data/english/annotations/mz/sinorama/10/ectb_1029.gold_skel ../../conll-formatted-ontonotes-5.0/data/conll-2012-test/data/english/annotations/mz/sinorama/10/ectb_1029.gold_conll -edited --text

has anyone encountered this error and can help me solve it? Thanks! @iftenney