preprocessing code for i2b2 dataset

uf-hobi-informatics-lab / ClinicalTransformerNER

a library for named entity recognition developed by UF HOBI NLP lab featuring SOTA algorithms

MIT License

145 stars 28 forks source link

preprocessing code for i2b2 dataset #25

Closed Lavenderjiang closed 2 years ago

Lavenderjiang commented 2 years ago

hello, thanks for creating this library. I am trying to reproduce the results for bert on i2b2 2010,2012 and n2c2 2018. However, I have trouble converting these dataset into the conll-2003 txt file shown in test_data. I assume the preprocessing script are different for each dataset because i2b2 2010 (txt, con) and 2012 (txt, extent, tlink) have different file extension.

Is it possible to release the preprocessing scripts for easier reproducibility?

bugface commented 2 years ago

i2b2 2010 dataset - the currently released data is not the one originally released during the challenge. We used the dataset preprocessed by our collaborator (we do not have access to their source code)
i2b2 2012 dataset - you can convert the released data to brat format (https://brat.nlplab.org/standoff.html) then you can follow our tutorial to convert the brat format to BIO format.
n2c2 2018 dataset - please see our paper https://academic.oup.com/jamia/article/27/1/65/5555856?login=true for how we preprocessed the data

Lavenderjiang commented 2 years ago

thanks for your prompt response! happy holiday