huggingface / hmtl

🌊HMTL: Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP
MIT License
1.19k stars 146 forks source link

conll2012 setup issue #13

Closed djshowtime closed 5 years ago

djshowtime commented 5 years ago

Hello, thanks for raising this question.

We used pre-trained word embeddings (Glove and ELMo). You can use the script scripts/data_setup.sh to download them and place them in a data folder.

Other datasets are also expected to be in the data folder (see the paths in the configuration files configs/*.json). For instance, we compile the CoNLL2012 coreference data using this script from AllenNLP: https://github.com/allenai/allennlp/blob/master/scripts/compile_coref_data.sh It compiles the CoNLL2012 data, and dump the coreference annotations into a single file. For NER CoNLL, it is basically the same data as coreference which have not been dumped into the same single file (we can probably do something quick to avoid this data duplication). Concerning the ACE data, we pre-process them so that the Mention Detection data match a CoNLL-NER format and the Relation Extraction task match a CoNLL-SRL format. Both are saved in a data/ace2005 folder.

If you want to use other datasets, it seems coherent to place them in the data folder, and use (if not modify) the dataset_readers classes.

Victor

Hi, I want to reproduce your NER result. However, I met a problem when I set up conll2012 data.

I used this script https://github.com/allenai/allennlp/blob/master/scripts/compile_coref_data.sh. But it warned that there is no .parse file in the folder.

could not find the gold parse [.//data/files/data/english/annotations/bc/cctv/00/cctv_0001.parse] in the ontonotes distribution ... exiting ...

cat: 'conll-2012/v4/data/development/data/english/annotations/*/*/*/*.v4_gold_conll': No such file or directory
cat: 'conll-2012/v4/data/train/data/english/annotations/*/*/*/*.v4_gold_conll': No such file or directory
cat: 'conll-2012/v4/data/test/data/english/annotations/*/*/*/*.v4_gold_conll': No such file or directory

Originally posted by @djshowtime in https://github.com/huggingface/hmtl/issues/2#issuecomment-481519182

Evpok commented 5 years ago

You will have to get the CoNLL-2012 data from http://conll.cemantix.org/2012/data.html

VictorSanh commented 5 years ago

Hello @djshowtime, As @Evpok said, you need to get the data first. I cannot distribute the data. But the script you use is also the one i used! Victor