uf-hobi-informatics-lab / ClinicalTransformerNER

a library for named entity recognition developed by UF HOBI NLP lab featuring SOTA algorithms
MIT License
145 stars 28 forks source link

N2C2 data preprocesssing #38

Open mnishant2 opened 8 months ago

mnishant2 commented 8 months ago

Hello, The brat2bio.ipynb does not work for the n2c2 2018 dataset. Do you know if there are any changes needed for it to be useful for n2c2.

bugface commented 8 months ago

Can you post the actual errors?

mnishant2 commented 8 months ago

Can you post the actual errors?

There is no error, just doesn't seem to work, a lot of drugs/reasons go undetected after a certain point. Also please confirm that sent_offset += (len(line.strip())+1)` not +2 works for n2c2

bugface commented 8 months ago

can you try: https://github.com/uf-hobi-informatics-lab/ClinicalTransformerNER/blob/master/tutorial/pipeline_preprocessing_model_training_prediction.ipynb, we use https://github.com/uf-hobi-informatics-lab/NLPreprocessing for preprocessing which we used for all of our previous works.

also, in n2c2 2018 dataset, some ADE and Reason are overlapped, what we did before is we have three copies for drug, ade and reason separately.

lastly, I recommend checking our project which can handle overlap: https://github.com/uf-hobi-informatics-lab/ClinicalTransformerMRC

mnishant2 commented 7 months ago

Thanks that worked, I have questions about the hyperparameter tuning/values needed to reproduce BERT and RoBERTa general results on all the datasets from table 2 in the paper. I am unable to reproduce the exact results. Would be really nice if you could point to that. I have also communicated through email to the corresponding author.

bugface commented 6 months ago

Thanks that worked, I have questions about the hyperparameter tuning/values needed to reproduce BERT and RoBERTa general results on all the datasets from table 2 in the paper. I am unable to reproduce the exact results. Would be really nice if you could point to that. I have also communicated through email to the corresponding author.

how far away? if it is within 0.002, then it should be OK. We use random seed=42, batch size = 4, and learning rate = 1e-5.