Error while training VisionEncoderDecoderModel ValueError: one or more references are empty strings

VallabhMahajan1 commented 1 year ago

I was trying to train a VisionEncoderDecoderModel and I got the below error. For decoder I'm using bert-base-multilingual-cased and encoder is google/vit-base-patch16-224. How to solve this error? Thanks in advace!!

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 13>:13                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1662 in train                    │
│                                                                                                  │
│   1659 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1660 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1661 │   │   )                                                                                 │
│ ❱ 1662 │   │   return inner_training_loop(                                                       │
│   1663 │   │   │   args=args,                                                                    │
│   1664 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1665 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2006 in _inner_training_loop     │
│                                                                                                  │
│   2003 │   │   │   │   │   self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epo  │
│   2004 │   │   │   │   │   self.control = self.callback_handler.on_step_end(args, self.state, s  │
│   2005 │   │   │   │   │                                                                         │
│ ❱ 2006 │   │   │   │   │   self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_k  │
│   2007 │   │   │   │   else:                                                                     │
│   2008 │   │   │   │   │   self.control = self.callback_handler.on_substep_end(args, self.state  │
│   2009                                                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2287 in _maybe_log_save_evaluate │
│                                                                                                  │
│   2284 │   │   │   │   │   )                                                                     │
│   2285 │   │   │   │   │   metrics.update(dataset_metrics)                                       │
│   2286 │   │   │   else:                                                                         │
│ ❱ 2287 │   │   │   │   metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)                 │
│   2288 │   │   │   self._report_to_hp_search(trial, self.state.global_step, metrics)             │
│   2289 │   │                                                                                     │
│   2290 │   │   if self.control.should_save:                                                      │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py:159 in evaluate          │
│                                                                                                  │
│   156 │   │   )                                                                                  │
│   157 │   │   self._gen_kwargs = gen_kwargs                                                      │
│   158 │   │                                                                                      │
│ ❱ 159 │   │   return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix   │
│   160 │                                                                                          │
│   161 │   def predict(                                                                           │
│   162 │   │   self,                                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2993 in evaluate                 │
│                                                                                                  │
│   2990 │   │   start_time = time.time()                                                          │
│   2991 │   │                                                                                     │
│   2992 │   │   eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else se  │
│ ❱ 2993 │   │   output = eval_loop(                                                               │
│   2994 │   │   │   eval_dataloader,                                                              │
│   2995 │   │   │   description="Evaluation",                                                     │
│   2996 │   │   │   # No point gathering the predictions if there are no metrics, otherwise we d  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3281 in evaluation_loop          │
│                                                                                                  │
│   3278 │   │   │   │   │   EvalPrediction(predictions=all_preds, label_ids=all_labels, inputs=a  │
│   3279 │   │   │   │   )                                                                         │
│   3280 │   │   │   else:                                                                         │
│ ❱ 3281 │   │   │   │   metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, lab  │
│   3282 │   │   else:                                                                             │
│   3283 │   │   │   metrics = {}                                                                  │
│   3284                                                                                           │
│ in compute_metrics:29                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/metric.py:453 in compute                        │
│                                                                                                  │
│   450 │   │   │                                                                                  │
│   451 │   │   │   inputs = {input_name: self.data[input_name] for input_name in self.features}   │
│   452 │   │   │   with temp_seed(self.seed):                                                     │
│ ❱ 453 │   │   │   │   output = self._compute(**inputs, **compute_kwargs)                         │
│   454 │   │   │                                                                                  │
│   455 │   │   │   if self.buf_writer is not None:                                                │
│   456 │   │   │   │   self.buf_writer = None                                                     │
│                                                                                                  │
│ /root/.cache/huggingface/modules/datasets_modules/metrics/cer/46482e3826224451c26c9b51d8d193d38a │
│ 4226daab693df497d2e397b623274e/cer.py:149 in _compute                                            │
│                                                                                                  │
│   146 │   │   incorrect = 0                                                                      │
│   147 │   │   total = 0                                                                          │
│   148 │   │   for prediction, reference in zip(predictions, references):                         │
│ ❱ 149 │   │   │   measures = jiwer.compute_measures(                                             │
│   150 │   │   │   │   reference,                                                                 │
│   151 │   │   │   │   prediction,                                                                │
│   152 │   │   │   │   truth_transform=cer_transform,                                             │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/jiwer/measures.py:306 in compute_measures                │
│                                                                                                  │
│   303 │   │   )                                                                                  │
│   304 │   )                                                                                      │
│   305 │                                                                                          │
│ ❱ 306 │   output = process_words(                                                                │
│   307 │   │   reference=truth,                                                                   │
│   308 │   │   hypothesis=hypothesis,                                                             │
│   309 │   │   reference_transform=truth_transform,                                               │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/jiwer/process.py:159 in process_words                    │
│                                                                                                  │
│   156 │   if isinstance(hypothesis, str):                                                        │
│   157 │   │   hypothesis = [hypothesis]                                                          │
│   158 │   if any(len(t) == 0 for t in reference):                                                │
│ ❱ 159 │   │   raise ValueError("one or more references are empty strings")                       │
│   160 │                                                                                          │
│   161 │   # pre-process reference and hypothesis by applying transforms                          │
│   162 │   ref_transformed = _apply_transform(                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: one or more references are empty strings

sgugger commented 1 year ago

cc @younesbelkada and @amyeroberts

amyeroberts commented 1 year ago

@VallabhMahajan1 Could you share a reproducible code snippet and information about the running environment (run transformers-cli env in the terminal and copy-paste the output)?

From the traceback, it seems the issue is coming in the metric calculation when using Trainer.

I'm able to build and run a small example with the checkpoints you shared on the main branch:

from transformers import AutoImageProcessor, AutoTokenizer, VisionEncoderDecoderModel
import requests
from PIL import Image
import torch

encoder_checkpoint = "google/vit-base-patch16-224"
decoder_checkpoint = "bert-base-multilingual-cased"

image_processor = AutoImageProcessor.from_pretrained(encoder_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(decoder_checkpoint)

model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    encoder_pretrained_model_name_or_path=encoder_checkpoint,
    decoder_pretrained_model_name_or_path=decoder_checkpoint,
)

# load image from the IAM dataset
url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# training
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size

pixel_values = image_processor(image, return_tensors="pt").pixel_values
text = "hello world"
labels = tokenizer(text, return_tensors="pt").input_ids
outputs = model(pixel_values=pixel_values, labels=labels)
loss = outputs.loss

VallabhMahajan1 commented 1 year ago

Thanks for the reply. I was trying to train trocr model. Below is the code snippet. I'm not sure but I guess we are got this error in compute matrix function.

- `transformers` version: 4.28.0
- Platform: Linux-5.15.107+-x86_64-with-glibc2.31
- Python version: 3.10.11
- Huggingface_hub version: 0.14.1
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.6.9 (gpu)
- Jax version: 0.4.10
- JaxLib version: 0.4.10
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
feature_extractor=ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
processor = TrOCRProcessor(feature_extractor = feature_extractor, tokenizer = tokenizer)
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained("google/vit-base-patch16-224", "bert-base-multilingual-cased")

cer_metric = load_metric("cer")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = processor.tokenizer.pad_token_id
    label_str = processor.batch_decode(labels_ids, skip_special_tokens=True)
    cer = cer_metric.compute(predictions=pred_str, references=label_str)

    return {"cer": cer}

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    num_train_epochs=1, 
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    fp16=True, 
    output_dir="./",
    logging_steps=2,
    save_strategy="no",
    eval_steps=100,
)

from transformers import default_data_collator

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=processor.tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
)
trainer.train()

VallabhMahajan1 commented 1 year ago

cc @younesbelkada and @amyeroberts

ydshieh commented 1 year ago

@VallabhMahajan1 Thank you for providing a code snippet.

However, the code snippet is incomplete: train_dataset and eval_dataset are not defined. If you can't provide these datasets, you can try to use public datasets (for example, on HF's dataset Hub) which is similar to your own datasets. In any case, please use a small dataset (or take a small slice from the large dataset).

Without a self-complete code snippet to reproduce, we are not able to help. Thank you.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

takipipo commented 4 months ago

@VallabhMahajan1 I got the same error but different language model. I used kkatiz/thai-trocr-thaigov-v2 and found out that the model does not support english uppercase character which is why some of the ground truth becomes "" (i.e. empty string)

linglongxian commented 1 month ago

@takipipo Hi, I got the same error with trocr model, did you solve this problem?

huggingface / transformers

Error while training VisionEncoderDecoderModel ValueError: one or more references are empty strings #23892