Issue in preprocessing ontonotes

ghaddarAbs commented 6 years ago

Hi, I am facing some issue with preprocessing ontonotes.

First i used the script of conll-2012 shared task to generate the *_gold_conll files which contains the annotations. For example: conll-2012/v4/data/train/data/english/annotations/bn/cnn/03/cnn_0301.v4_gold_conll In conll-2012/v4/data directory i have 3 sub directories: train/dev/test

My question is: How should i group *_gold_conll files in order to fit the required format of preprocess.py? Should i concatenate all *_gold_conll in one file for train/dev/test?

Regards

strubell commented 6 years ago

Thanks for pointing this out! It looks like not all of the ontonotes pre-processing got properly ported to this repo. I'll try to push I fix later today.

On Thu, Mar 8, 2018 at 4:56 PM ghaddarAbs notifications@github.com wrote:

Hi, I am facing some issue with preprocessing ontonotes.

First i used the script of conll-2012 shared task to generate the *_gold_conll files which contains the annotations. For example:

conll-2012/v4/data/train/data/english/annotations/bn/cnn/03/cnn_0301.v4_gold_conll In conll-2012/v4/data directory i have 3 sub directories: train/dev/test

My question is: How should i group *_gold_conll files in order to fit the required format of preprocess.py

Regards

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/iesl/dilated-cnn-ner/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZt0NAZPPZL3XWfGCYRh9vybrAN8Oqks5tcakQgaJpZM4SjaWU .

strubell commented 6 years ago

Okay, I've pushed a fix. You should be able to simply run e.g. ./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf as in the readme. Let me know if it works for you.

ghaddarAbs commented 6 years ago

Hi,

Thank you for taking the time to consider my request. the new code have solved part of the problem but the code still need some modifs.

Here some suggestion to make the code compatible with the directory structure produced by skeleton2conll.sh script :

conf/ontonotes/ontonotes.conf line 5 -> change "dev" to "development" (Minor).
change https://github.com/iesl/dilated-cnn-ner/blob/0b4955a7338b0a6f2bf40c9e560de885060dcd06/bin/preprocess.sh#L45 to cat `find $raw_data_dir/${data_files[0]} -type f -name \*_gold_conll | grep -v "/pt/nt" | grep "english"` \

The old command failed to gather all *_gold_conll files.
grep -v "/pt/nt" -> Skip New Testaments portion (Optional).
grep "english -> Avoid chinese and arabic files in data_file directory (e.g. data/train/data/chinese/annotations/wb/e2c/00).

change

https://github.com/iesl/dilated-cnn-ner/blob/0b4955a7338b0a6f2bf40c9e560de885060dcd06/bin/preprocess.sh#L72

to for this_data_file in `find $raw_data_dir/$filename -type d -links 2 | tail -n +2 | grep -v "/pt/nt" | grep "english"`; do

-links 2" -> Get leaf directories only (otherwise it will raise an error in src/tsv_to_tfrecords.py)

change https://github.com/iesl/dilated-cnn-ner/blob/0b4955a7338b0a6f2bf40c9e560de885060dcd06/src/tsv_to_tfrecords.py#L421 to if fname.endswith("_gold_conll"): otherwise it will read _auto_conll files.
I was woundering why tail -n +2 ?

Even with these modifs there still some issues (mainly in tsv_to_tfrecords.py):

With the current directory structure and because line https://github.com/iesl/dilated-cnn-ner/blob/0b4955a7338b0a6f2bf40c9e560de885060dcd06/src/tsv_to_tfrecords.py#L418 each time tsv_to_tfrecords.py is called it will write an output with the name of the last directory. Here samples of the directory names:
```
conll-2012/data/train/data/english/annotations/nw/wsj/19
conll-2012/data/train/data/english/annotations/nw/wsj/05
conll-2012/data/train/data/english/annotations/nw/wsj/13 
```
Consequently tsv_to_tfrecords.py will write 19_sizes.txt 05_sizes.txt... rather than nw_sizes.txt

I suggest that in_file of src/tsv_to_tfrecords.py be the document genre path (e.g. conll-2012/data/train/data/english/annotations/nw) and replacing
https://github.com/iesl/dilated-cnn-ner/blob/0b4955a7338b0a6f2bf40c9e560de885060dcd06/src/tsv_to_tfrecords.py#L420

by

file_list = [y for x in os.walk(in_file) for y in glob(os.path.join(x[0], '*_gold_conll'))]
for fname in os.listdir(file_list):

kamalkraj commented 6 years ago

What is folder structure of conll-2012 dataset

strubell commented 6 years ago

@ghaddarAbs thanks for sharing these changes! I forgot that the folder I'm using removed some of the directory structure. Could you submit these as a PR?

ghaddarAbs commented 6 years ago

@strubell done.... Now the code compatible with the directory structure produced by skeleton2conll.sh .

each of train|dev|test /protos directories now contain 6 files bc|bn|mz|nw|tc|wb_examples.proto

marc88 commented 5 years ago

Hello, Can anyone suggest on the data processing to be done on conll2012 before calling the following?

./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf Currently, simply calling the preprocess.sh script as above, does write anything to the file mentioned below and goes into an infinite loop I suppose. data/vocabs/ontonotes_cutoff_4.txt

I've downloaded the train v4, dev v4 and test v9 tarballs from http://conll.cemantix.org/2012/data.html

Edit: I could convert the ontonotes files successfully to conll format but not sure of the directory structure to trigger the preprocessing script. Can you help? The following is my directory structure:

$DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0 structure for conll-formatted-ontonotes-5.0:

conll-formatted-ontonotes-5.0
├── data
│   ├── development
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   ├── test
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   └── train
│       └── data
│           ├── arabic
│           │   └── annotations
│           ├── chinese
│           │   └── annotations
│           └── english
│               └── annotations
└── scripts

Regards

iesl / dilated-cnn-ner

Issue in preprocessing ontonotes #8