Closed crowegian closed 5 years ago
I also am not sure where the file devel_docids.txt
comes from for the i2b2 and clef dataset creation scripts. Are they created elsewhere, or should I be downloading them from the public databases?
I also am not sure if I am using the correct data files from the various data sources and would greatly appreciate a readme with information on which files should be used from each download website.
I also am not sure where the file devel_docids.txt comes from for the i2b2 and clef dataset creation scripts. Are they created elsewhere, or should I be downloading them from the public databases?
The devel_docids.txt
of ShAReCLEFE can be found in the data.zip
file at https://github.com/ncbi-nlp/BLUE_Benchmark/releases/download/0.1/data.zip
what scripts should be run to create all the different train/dev/test datasets?
We didn't provide the script to create train/dev datasets because there are different ways to preprocess the data, such as different tokenziers, sentence splitters, and parsers. We don't want the users to limit themselves by using our preprocessing methods. However, if you want to test the BERT model, you can dwonload the preprocessed data at https://github.com/ncbi-nlp/BLUE_Benchmark/releases/download/0.1/bert_data.zip and the codes are at https://github.com/ncbi-nlp/BLUE_Benchmark/tree/master/blue/bert
Got it. Thank you.
Hi I'm not sure if this is an issue so much as a workflow question, so apologies in advance if it doesn't fit here.
But, what scripts should be run to create all the different train/dev/test datasets? I see a bash script for creating some test sets, but not for creating training sets, those do seem to have python scripts though. Is there code for unifying this workflow?
Best, Oliver