Is the i2b2-2010 dataset used?

ncbi-nlp / BLUE_Benchmark

BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora.

https://arxiv.org/abs/1906.05474

Other

286 stars 40 forks source link

Is the i2b2-2010 dataset used? #6

Closed crowegian closed 5 years ago

crowegian commented 5 years ago

Hi,

I'm working on recreating the datasets and I think there's a discrepency from the NCBI BERT github code, and the benchmark code. The i2b2 processing code uses the 2010 dataset, while the readme in the NCBI BERT dataset seems to use the 2012 i2b2 data. Looking at the code to run on the i2b2 data there are no mentions of the labels used in the processing code, and the task seems to have changed from relation extraction to named entity recognition. The paper also discusses i2b2 as a relation extraction dataset, is there code available for modeling this task?

I'm also a bit confused why the processing code replaces tokens in the input text with special tokens like @problem$. This could be part of the task, but it would seem to me that keeping those tokens would provide important information.

Thank you for your help.

Best, Oliver

yfpeng commented 5 years ago

The i2b2 processing code uses the 2010 dataset, while the readme in the NCBI BERT dataset seems to use the 2012 i2b2 data. Looking at the code to run on the i2b2 data there are no mentions of the labels used in the processing code, and the task seems to have changed from relation extraction to named entity recognition. The paper also discusses i2b2 as a relation extraction dataset, is there code available for modeling this task?

Sorry for the confusion. We used the i2b2-2010 data for the relation extraction task. We didn't use the i2b2-2012 data for the NER tasks. I have thus removed the codes in the run_ncbi_ner.py and added the codes in run_ncbi.py

yfpeng commented 5 years ago

I'm also a bit confused why the processing code replaces tokens in the input text with special tokens like @problem$. This could be part of the task, but it would seem to me that keeping those tokens would provide important information.

This is one way to use BERT models for the relation extraction tasks. There are of course other ways, such as keeping those tokens. Feel free to explore more.

crowegian commented 5 years ago

That clears up a lot of confusion. Thank you!

Ritika2001 commented 4 years ago

It is mentioned in the Readme that the 170 documents for training and 256 documents for testing in the i2b2 dataset are only part of the subset. From where do I obtain the entire subset? After applying for the dataset, only the 170 and 256 documents zip files are present.