obi-ml-public / ehr_deidentification

Robust de-identification of medical notes using transformer architectures
MIT License
41 stars 9 forks source link

Detecting New Entities In Model #13

Open Evangel-coder opened 10 months ago

Evangel-coder commented 10 months ago

Hi,

Would like to ask how do I train/ fine-tune the model to detect for new entities with the "ID" tag , e.g medicare number, phone number with +65. Real appreciate any insights on that!

prajwal967 commented 9 months ago

Hi,

Sorry for the delayed response. Do you have a dataset with these new entities that you want to train the model on?

If you do, then you would need to get the data in this form: notes.jsonl

Then you can follow the steps given in this notebook: Train.ipynb and replace the files accordingly.

Let us know if that doesn't work!

Evangel-coder commented 9 months ago

Hi, That’s alright! Really appreciate the quick comment, still annotating the dataset manually Have some qns to clarify,

  1. either we have to use prodigy or just annotate it manually for the dataset ?
  2. For me to train it to detect Medicare number, can I do this with just a dataset that only contains Medicare number. If so, what would be the optimal number of data points needed. Also, saw that there was this model available on hugging face:
  3. would like to ask if it was possible to have just use the autotokenisor & Auto-model library to load up and train the model as it skips the tedious process of getting the libraries in for the model to work.

Really appreciate the insights given!

prajwal967 commented 8 months ago

Hi, sorry for the delay.

  1. We used prodigy for annotation - while you can do it manually, it it more efficient to do it using prodigy.
  2. Yes, if you want to train a model only to detect medicare number, you can train it against a dataset with only medicare number. However, this model won't be able to predict other attributes (e.g. name, date etc).
  3. Yes, if you are training a new model from scratch you can use the AutoClasses. If you have the dataset you can follow the steps given here: https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py

    • Our code mostly follows thier approach, but their code is more up to date and might be a better starting point if you're training something new.

    Let us know if you have any other questions, thanks!

Evangel-coder commented 8 months ago

Hi,

Appreciate the informative response and wanted to clarify so I can fine-tune the model to work with a higher capability of detecting Medicare numbers as well. Just that I would need the I2B2 data with the I2B2 data, including a variety of Medicare number in it. And that data has to be in the stated data format that was described in repo. Hope to hear from you soon!

prajwal967 commented 8 months ago

Yes, that sounds about right! Let us know if there are any issues, thanks!

Evangel-coder commented 8 months ago

Alright, thanks for that clarification. I was under the impression that if I were to just fine-tune the model with just US Medicare numbers, the model would just add on to its capability of not just detecting 'Medicare number' but also continuing to detect other attributes like 'name' and 'date' at the same accuracy with Medicare as well?

prajwal967 commented 8 months ago

Yes, that could work, but there is a possibility that the model might forget what it has learnt previously (the accuracy of detecting other types of PHI might decrease)

Also, do you have a dataset with just medicare numbers? If so what does that dataset look like?

Evangel-coder commented 8 months ago

Hi,I've got a dataset; its just in the format of JSON, as described in the instructions provided. I'll try it out first, will let you on the results, is there any email contact I can use to contact you (if possible)?