google-research / bigbird

Transformers for Longer Sequences
https://arxiv.org/abs/2007.14062
Apache License 2.0
566 stars 102 forks source link

Error in PubMed evaluation using run_summarization.py #15

Open Amit-GH opened 3 years ago

Amit-GH commented 3 years ago

I am using the script roberta_base.sh to train and test the model on PubMed summarization task. I am able to successfully train the model for multiple steps (5000) but it fails during evaluation time. Below is some of the error string.

I0416 18:16:41.567906 139788890330944 error_handling.py:115] evaluation_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0416 18:16:41.568143 139788890330944 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "bigbird/summarization/run_summarization.py", line 534, in <module>
    app.run(main)
...
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2268, in create_tpu_hostcall
    'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:0", shape=(), dtype=float32, device=/job:worker/task:0/device:CPU:0)

I am not too familiar with the code and about this error. Searched it online but didn't get much help. Hope you can help. Below is the script which I ran to reproduce this error:

python3 bigbird/summarization/run_summarization.py \
  --data_dir="tfds://scientific_papers/pubmed" \
  --output_dir=gs://bigbird-replication-bucket/summarization/pubmed \
  --attention_type=block_sparse \
  --couple_encoder_decoder=True \
  --max_encoder_length=3072 \
  --max_decoder_length=256 \
  --num_attention_heads=12 \
  --num_hidden_layers=12 \
  --hidden_size=768 \
  --intermediate_size=3072 \
  --block_size=64 \
  --train_batch_size=2 \
  --eval_batch_size=4 \
  --num_train_steps=1000 \
  --do_train=True \
  --do_eval=True \
  --use_tpu=True \
  --tpu_name=bigbird \
  --tpu_zone=us-central1-b \
  --gcp_project=bigbird-replication \
  --num_tpu_cores=8 \
  --save_checkpoints_steps=1000 \
  --init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0
prathameshk commented 3 years ago

I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?

gymbeijing commented 3 years ago

I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?

Hi @prathameshk, can I ask how do you finetune the model on your custom dataset? I was thinking replace data_dir by path_contains_tfrecords, but I got error:

(0) Invalid argument: Feature: document (data type: string) is required but could not be found.
          [[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
          [[MultiDeviceIteratorGetNextFromShard]]
          [[RemoteCall]] 
         [[IteratorGetNext]]
          [[Mean/_19475]]

Updates: I solved this problem by replacing the name_to_features fields with the actual fields in the tfrecord file.

Amit-GH commented 3 years ago

If you haven't already, then check out the HuggingFace implementation of BigBird. That can be easier to use and integrate with your project.