Closed ghaddarAbs closed 6 years ago
Thanks for pointing this out! It looks like not all of the ontonotes pre-processing got properly ported to this repo. I'll try to push I fix later today.
On Thu, Mar 8, 2018 at 4:56 PM ghaddarAbs notifications@github.com wrote:
Hi, I am facing some issue with preprocessing ontonotes.
First i used the script of conll-2012 shared task to generate the *_gold_conll files which contains the annotations. For example:
conll-2012/v4/data/train/data/english/annotations/bn/cnn/03/cnn_0301.v4_gold_conll In conll-2012/v4/data directory i have 3 sub directories: train/dev/test
My question is: How should i group *_gold_conll files in order to fit the required format of preprocess.py
Regards
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/iesl/dilated-cnn-ner/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZt0NAZPPZL3XWfGCYRh9vybrAN8Oqks5tcakQgaJpZM4SjaWU .
Okay, I've pushed a fix. You should be able to simply run e.g. ./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf
as in the readme. Let me know if it works for you.
Hi,
Thank you for taking the time to consider my request. the new code have solved part of the problem but the code still need some modifs.
Here some suggestion to make the code compatible with the directory structure produced by skeleton2conll.sh script :
conf/ontonotes/ontonotes.conf line 5 -> change "dev" to "development"
(Minor).
change
https://github.com/iesl/dilated-cnn-ner/blob/0b4955a7338b0a6f2bf40c9e560de885060dcd06/bin/preprocess.sh#L45
to
cat `find $raw_data_dir/${data_files[0]} -type f -name \*_gold_conll | grep -v "/pt/nt" | grep "english"` \
*_gold_conll
files.grep -v "/pt/nt"
-> Skip New Testaments portion (Optional).grep "english
-> Avoid chinese and arabic files in data_file directory (e.g. data/train/data/chinese/annotations/wb/e2c/00
).to
for this_data_file in `find $raw_data_dir/$filename -type d -links 2 | tail -n +2 | grep -v "/pt/nt" | grep "english"`; do
-links 2"
-> Get leaf directories only (otherwise it will raise an error in src/tsv_to_tfrecords.py
)change
https://github.com/iesl/dilated-cnn-ner/blob/0b4955a7338b0a6f2bf40c9e560de885060dcd06/src/tsv_to_tfrecords.py#L421
to if fname.endswith("_gold_conll"):
otherwise it will read _auto_conll
files.
I was woundering why tail -n +2
?
Even with these modifs there still some issues (mainly in tsv_to_tfrecords.py
):
tsv_to_tfrecords.py
is called it will write an output with the name of the last directory. Here samples of the directory names:
conll-2012/data/train/data/english/annotations/nw/wsj/19
conll-2012/data/train/data/english/annotations/nw/wsj/05
conll-2012/data/train/data/english/annotations/nw/wsj/13
Consequently tsv_to_tfrecords.py
will write 19_sizes.txt 05_sizes.txt... rather than nw_sizes.txt
I suggest that in_file of src/tsv_to_tfrecords.py
be the document genre path (e.g. conll-2012/data/train/data/english/annotations/nw
) and replacing
https://github.com/iesl/dilated-cnn-ner/blob/0b4955a7338b0a6f2bf40c9e560de885060dcd06/src/tsv_to_tfrecords.py#L420
by
file_list = [y for x in os.walk(in_file) for y in glob(os.path.join(x[0], '*_gold_conll'))]
for fname in os.listdir(file_list):
What is folder structure of conll-2012 dataset
@ghaddarAbs thanks for sharing these changes! I forgot that the folder I'm using removed some of the directory structure. Could you submit these as a PR?
@strubell done.... Now the code compatible with the directory structure produced by skeleton2conll.sh .
each of train|dev|test /protos directories now contain 6 files bc|bn|mz|nw|tc|wb_examples.proto
Hello, Can anyone suggest on the data processing to be done on conll2012 before calling the following?
./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf
Currently, simply calling the preprocess.sh script as above, does write anything to the file mentioned below and goes into an infinite loop I suppose.
data/vocabs/ontonotes_cutoff_4.txt
I've downloaded the train v4, dev v4 and test v9 tarballs from http://conll.cemantix.org/2012/data.html
Edit: I could convert the ontonotes files successfully to conll format but not sure of the directory structure to trigger the preprocessing script. Can you help? The following is my directory structure:
$DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0 structure for conll-formatted-ontonotes-5.0:
conll-formatted-ontonotes-5.0
├── data
│ ├── development
│ │ └── data
│ │ ├── arabic
│ │ │ └── annotations
│ │ ├── chinese
│ │ │ └── annotations
│ │ └── english
│ │ └── annotations
│ ├── test
│ │ └── data
│ │ ├── arabic
│ │ │ └── annotations
│ │ ├── chinese
│ │ │ └── annotations
│ │ └── english
│ │ └── annotations
│ └── train
│ └── data
│ ├── arabic
│ │ └── annotations
│ ├── chinese
│ │ └── annotations
│ └── english
│ └── annotations
└── scripts
Regards
Hi, I am facing some issue with preprocessing ontonotes.
First i used the script of conll-2012 shared task to generate the
*_gold_conll
files which contains the annotations. For example:conll-2012/v4/data/train/data/english/annotations/bn/cnn/03/cnn_0301.v4_gold_conll
Inconll-2012/v4/data
directory i have 3 sub directories: train/dev/testMy question is: How should i group
*_gold_conll
files in order to fit the required format of preprocess.py? Should i concatenate all*_gold_conll
in one file for train/dev/test?Regards