ncbi-nlp / bluebert

BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).
https://arxiv.org/abs/1906.05474
Other
558 stars 78 forks source link

KeyError when running run_bluebert_multi_labels.py #20

Closed snjie209 closed 4 years ago

snjie209 commented 4 years ago

Hi,

I am trying to fine-tune BlueBERT for classifying a set of clinical notes into a binary task. I have set up by train.tsv and dev.tsv files as such:

1   1   a   Assessment and Plan... <more notes here>

I was not sure whether this is the right format for BlueBERT, but for BERT, it seems that based on the following article: https://blog.insightdatascience.com/using-bert-for-state-of-the-art-pre-training-for-natural-language-processing-1d87142c29e7, the following format is followed for the tsv input data:

Column 1: An ID for the row (can be just a count, or even just the same number or letter for every row, if you don’t care to keep track of each individual example).
Column 2: A label for the row as an int. These are the classification labels that your classifier aims to predict.
Column 3: A column of all the same letter — this is a throw-away column that you need to include because the BERT model expects it.
Column 4: The text examples you want to classify.

However, when I run the following code:

python ../bluebert/bluebert/run_bluebert_multi_labels.py \
  --task_name="hoc" \
  --do_train=true \
  --do_eval=true \
  --do_predict=true \
  --vocab_file=$BlueBERT_DIR/vocab.txt \
  --bert_config_file=$BlueBERT_DIR/bert_config.json \
  --init_checkpoint=$BlueBERT_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=4 \
  --learning_rate=2e-5 \
  --num_train_epochs=3 \
  --num_classes=2 \
  --num_aspects=2 \
  --data_dir=$DATASET_DIR \
  --output_dir=$OUTPUT_DIR \
  --aspect_value_list="0,1"

I get the following error:

/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x124253730>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_model_dir': '/Users/sambamamba/Documents/SCPD/CS_230/Project/sywang/lowva_bluebert', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1245052e8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, computation_shape=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
INFO:tensorflow:Writing example 0 of 4957
Traceback (most recent call last):
  File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 920, in <module>
    tf.app.run()
  File "/Users/sambamamba/anaconda/envs/blue_env/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 811, in main
    train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
  File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 400, in file_based_convert_examples_to_features
    max_seq_length, tokenizer)
  File "../bluebert/bluebert/run_bluebert_multi_labels.py", line 366, in convert_single_example
    label_id = label_map[example.label]
KeyError: '2'

Looking into run_bluebert_multi_labels.py, it seems that the label_map variable is populated based on the entry to the num_aspects and aspect_value_list flag arguments. On line 233 of this Python file, we see that get_labels method is used to create the label_list which is then fed into label_map:

def get_labels(self):
        """See base class."""
        label_list = []
        # num_aspect=FLAGS.num_aspects
        aspect_value_list = FLAGS.aspect_value_list  # [-2,-1,0,1]
        for i in range(FLAGS.num_aspects):
            for value in aspect_value_list:
                label_list.append(str(i) + "_" + str(value))
        return label_list  # [ {'0_-2': 0, '0_-1': 1, '0_0': 2, '0_1': 3,....'19_-2': 76, '19_-1': 77, '19_0': 78, '19_1': 79}]

which is fed into line 277:

label_map = {}
    for (i, label) in enumerate(label_list):
        label_map[label] = i

The example of what label_map keys would then be 0_-2, 0_-1, etc. I printed right before the line of the error ( line 365) and saw that

label_map is {'0_0': 0, '0_1': 1, '1_0': 2, '1_1': 3}
example.label is 2

So when we run label_id = label_map[example.label], we get a KeyError. So why is example.label being fed these underscored keys? Am I missing something here?

yfpeng commented 4 years ago

Please use "," to separate the labels. For example,

labels \t sentence
0_0,1_0,2_0,3_0,4_0,5_0,6_0,7_0,8_0,9_0 \t Assessment and Plan... <more notes here>
snjie209 commented 4 years ago

Thanks for the quick response. Are you saying also that we should have four columns in train.tsv?

Also, does each label have to be in “0_1” underscore format? What is this meant to illustrate?

And in your code snippet, are you illustrating one row of data?

Thanks for reading

yfpeng commented 4 years ago
  1. two columns, one for labels and the other for text
  2. no. you need to figure out how to represent multi-labels yourself.
  3. the header and one row of data.
snjie209 commented 4 years ago

Okay thanks again. Just to clarify: If I only have a binary classification task, such as 0,1, then I am assuming the format can be

0 \t Assessment and Plan ...
1 \t Prognosis... 

Where above I am illustrating two rows of data: the first row with a label of 0, the second row with a label of 1. Also no headers in the above

yfpeng commented 4 years ago

For binary classification, please use run_bluebert.py

snjie209 commented 4 years ago

Thanks Yifan. It seems to be running for me now with run_bluebert.py.

As a note to other readers, it seems that the KeyError is an issue mainly on the original Google Research BERT github. A lot of folks (ex: https://github.com/google-research/bert/issues/559) filed issues with a similar error, and they had to go into the get_labels implemented method and change the method. For me, I changed the labels to return ['0', '1'] to fit the labels of my binary classification task in rub_bluebert.py.

AliNazeri commented 1 year ago

I want to use run_bluebert_multi_labels.py for mimic-iv. I have separated the data into train.tsv and test.tsv. when I run the py file, I receive an error. I want to know how should I feed my labels. now they are like 1sda2,1s6w6,5fef,... it should be in this 1_0,2_0,.. format?