Closed SeekPoint closed 5 years ago
Any hope of releasing the ACE → CoNLL preprocessor ?
Hey, I am just attaching a really basic script I used for pre-processing: https://gist.github.com/VictorSanh/6cfce8bad8a80d3ba1cd1c95aba2216d It is a simple adaptation of this data processor from Miwa and Bansal: https://github.com/tticoin/LSTM-ER/tree/master/data/ace2005
Thanks !
Hello, thanks for raising this question.
We used pre-trained word embeddings (Glove and ELMo). You can use the script
scripts/data_setup.sh
to download them and place them in adata
folder.Other datasets are also expected to be in the
data
folder (see the paths in the configuration filesconfigs/*.json
). For instance, we compile the CoNLL2012 coreference data using this script from AllenNLP: https://github.com/allenai/allennlp/blob/master/scripts/compile_coref_data.sh It compiles the CoNLL2012 data, and dump the coreference annotations into a single file. For NER CoNLL, it is basically the same data as coreference which have not been dumped into the same single file (we can probably do something quick to avoid this data duplication). Concerning the ACE data, we pre-process them so that the Mention Detection data match a CoNLL-NER format and the Relation Extraction task match a CoNLL-SRL format. Both are saved in adata/ace2005
folder.If you want to use other datasets, it seems coherent to place them in the
data
folder, and use (if not modify) thedataset_readers
classes.Victor
Hi, I want to reproduce your NER result. However, I met a problem when I set up conll2012 data.
I used this script https://github.com/allenai/allennlp/blob/master/scripts/compile_coref_data.sh. But it warned that there is no .parse file in the folder.
could not find the gold parse [.//data/files/data/english/annotations/bc/cctv/00/cctv_0001.parse] in the ontonotes distribution ... exiting ...
cat: 'conll-2012/v4/data/development/data/english/annotations/*/*/*/*.v4_gold_conll': No such file or directory
cat: 'conll-2012/v4/data/train/data/english/annotations/*/*/*/*.v4_gold_conll': No such file or directory
cat: 'conll-2012/v4/data/test/data/english/annotations/*/*/*/*.v4_gold_conll': No such file or directory
Hello, thanks for raising this question.
We used pre-trained word embeddings (Glove and ELMo). You can use the script
scripts/data_setup.sh
to download them and place them in adata
folder.Other datasets are also expected to be in the
data
folder (see the paths in the configuration filesconfigs/*.json
). For instance, we compile the CoNLL2012 coreference data using this script from AllenNLP: https://github.com/allenai/allennlp/blob/master/scripts/compile_coref_data.sh It compiles the CoNLL2012 data, and dump the coreference annotations into a single file. For NER CoNLL, it is basically the same data as coreference which have not been dumped into the same single file (we can probably do something quick to avoid this data duplication). Concerning the ACE data, we pre-process them so that the Mention Detection data match a CoNLL-NER format and the Relation Extraction task match a CoNLL-SRL format. Both are saved in adata/ace2005
folder.If you want to use other datasets, it seems coherent to place them in the
data
folder, and use (if not modify) thedataset_readers
classes.Victor