uf-hobi-informatics-lab / ClinicalTransformerNER

a library for named entity recognition developed by UF HOBI NLP lab featuring SOTA algorithms
MIT License
142 stars 28 forks source link

No such file or directory: label2idx.json #13

Closed mnhcorp closed 3 years ago

mnhcorp commented 3 years ago

Hi,

Trying to run a batch prediction as such:

python ./src/run_transformer_batch_prediction.py \
      --model_type bert \
      --pretrained_model models/mimiciii_bert_10e_128b/ \
      --raw_text_dir ./raw-mimic/ \
      --preprocessed_text_dir ./iob-mimic/ \
      --output_dir ./prediction-results \
      --max_seq_length 512 \
      --do_lower_case \
      --eval_batch_size 8 \
      --log_file ./log.txt\
      --do_format 0 \
      --do_copy

Running into this error:

Traceback (most recent call last):
  File "./src/run_transformer_batch_prediction.py", line 123, in <module>
    main(global_args)
  File "./src/run_transformer_batch_prediction.py", line 31, in main
    label2idx = json_load(os.path.join(args.pretrained_model, "label2idx.json"))
  File "/home/ubuntu/mimic2iob/ClinicalTransformerNER/src/common_utils/common_io.py", line 32, in json_load
    with open(ifn, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'models/mimiciii_bert_10e_128b/label2idx.json'

I've downloaded the pre-trained BERT base + MIMIC model from here:
https://transformer-models.s3.amazonaws.com/mimiciii_bert_10e_128b.zip

I don't see label2idx.json present after extracting the archive:

$ ls -ltr models/mimiciii_bert_10e_128b/
total 430396
-rw-r--r-- 1 ubuntu ubuntu    231508 Dec 11  2019 vocab.txt
-rw-r--r-- 1 ubuntu ubuntu       170 Dec 11  2019 tokenizer_config.json
-rw-r--r-- 1 ubuntu ubuntu       112 Dec 11  2019 special_tokens_map.json
-rw-r--r-- 1 ubuntu ubuntu         2 Dec 11  2019 added_tokens.json
-rw-r--r-- 1 ubuntu ubuntu 440470760 Dec 11  2019 pytorch_model.bin
-rw-r--r-- 1 ubuntu ubuntu       566 Dec 11  2019 config.json

Any help would be much appreciated. Thanks for your project!

bugface commented 3 years ago

the model you downloaded is a pretrained model, not a NER model. If you want to repeat our results, you have to tell me which dataset you are using so we can provide the model accordingly. If you want to develop a model on your own dataset, then you need to train a NER model using your own training data first then do prediction/evaluation. During the training, we will create label2idx.json based on your data.

We trained the ner model on 2010i2b2, 2012 i2b2 and 2018 n2c2 datasets. If your dataset is not one of them, you have to train your own NER model using the training module.

mnhcorp commented 3 years ago

Understood, thank you.

Can you provide a link to the NER model trained with the 2018 n2c2 dataset?

bugface commented 3 years ago
  1. For the 2018 n2c2, we trained 3 NER models for drug + drug attributes, reason, and ADE, respectively then combine the results together into brat format for evaluation due to the fact that reason and ADE have overlapped annotated entities. If you want to use our models, you have to preprocess your data accordingly.

  2. It will take some time for me to upload the models to amazon s3 then create the download links. I will upload all the models in the next few days when I have time.

mnhcorp commented 3 years ago

Hi,

Thanks - don't want to take up a lot of your time. Perhaps I can try to generate my own training data and run the scripts.

Quick question: Is there an easy way to generate training data (from free-form text) in the IOB format, as specified in the test_data folder?

Thanks Again.

bugface commented 3 years ago
  1. For IOB generate from BRAT data, you can do https://github.com/nlplab/brat/blob/master/tools/anntoconll.py. Note this script may not be able to generate position information as we did. But I think the IOB generated should be able to be trained and tested with our package. But you may not be able to convert the model output back to BRAT. I have not carefully investigated the anntoconll.py, but the idea is that there should be open-source solutions online.

  2. We are working on a tutorial to demonstrate data generation and training currently. It should come out in the next few weeks.

mnhcorp commented 3 years ago

Thank you.

Will look forward to the tutorials.

jiwonjoung commented 2 years ago
  1. For the 2018 n2c2, we trained 3 NER models for drug + drug attributes, reason, and ADE, respectively then combine the results together into brat format for evaluation due to the fact that reason and ADE have overlapped annotated entities. If you want to use our models, you have to preprocess your data accordingly.
  2. It will take some time for me to upload the models to amazon s3 then create the download links. I will upload all the models in the next few days when I have time.

Hello! Have you uploaded the NER model for the n2c2 dataset? Where can I find it? Thanks.