Workflow for creating train/dev/test datasets

ncbi-nlp / BLUE_Benchmark

BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora.

https://arxiv.org/abs/1906.05474

Other

286 stars 40 forks source link

Workflow for creating train/dev/test datasets #5

Closed crowegian closed 5 years ago

crowegian commented 5 years ago

Hi I'm not sure if this is an issue so much as a workflow question, so apologies in advance if it doesn't fit here.

But, what scripts should be run to create all the different train/dev/test datasets? I see a bash script for creating some test sets, but not for creating training sets, those do seem to have python scripts though. Is there code for unifying this workflow?

Best, Oliver

crowegian commented 5 years ago

I also am not sure where the file devel_docids.txt comes from for the i2b2 and clef dataset creation scripts. Are they created elsewhere, or should I be downloading them from the public databases?

I also am not sure if I am using the correct data files from the various data sources and would greatly appreciate a readme with information on which files should be used from each download website.

yfpeng commented 5 years ago

I also am not sure where the file devel_docids.txt comes from for the i2b2 and clef dataset creation scripts. Are they created elsewhere, or should I be downloading them from the public databases?

The devel_docids.txt of ShAReCLEFE can be found in the data.zip file at https://github.com/ncbi-nlp/BLUE_Benchmark/releases/download/0.1/data.zip

yfpeng commented 5 years ago

what scripts should be run to create all the different train/dev/test datasets?

We didn't provide the script to create train/dev datasets because there are different ways to preprocess the data, such as different tokenziers, sentence splitters, and parsers. We don't want the users to limit themselves by using our preprocessing methods. However, if you want to test the BERT model, you can dwonload the preprocessed data at https://github.com/ncbi-nlp/BLUE_Benchmark/releases/download/0.1/bert_data.zip and the codes are at https://github.com/ncbi-nlp/BLUE_Benchmark/tree/master/blue/bert

crowegian commented 5 years ago

Got it. Thank you.