huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

RuntimeError: Empty or `None` reference sentence found. #16471

Closed Gare-Ng closed 2 years ago

Gare-Ng commented 2 years ago

Environment info

Who can help

@patrickvonplaten , @patil-suraj

Information

Model I am using (Bert, XLNet ...): Marian

The problem arises when using:

The tasks I am working on is:

And here is a example of my dataset

{"translation": {"Cls": "谁司票拟?", "Mdn": "谁起草的这个命令?"}} {"translation": {"Cls": "百司章奏,置急足驰白乃下。", "Mdn": "百官的奏章,要用快马才能赶上。"}}

I organize the data as I'm told in readme. Although in readme it says that it should be jsonl file, I find that jsonl file won't work and just rename it as json works. It works well for several thousands steps.

To reproduce

Steps to reproduce the behavior:

  1. run this command python examples/pytorch/translation/run_translation.py --model_name_or_path "Helsinki-NLP/opus-mt-zh-en" --do_train --do_eval --source_lang Cls --target_lang Mdn --source_prefix "translate Classical to Modern: " --train_file examples\pytorch\translation\train.json --validation_file examples\pytorch\translation\dev.json --test_file examples\pytorch\translation\test.json --output_dir D:/Gare-translation/Helsinki-NLP --per_device_train_batch_size=32 --per_device_eval_batch_size=16 --overwrite_output_dir --predict_with_generate --num_train_epochs=200 --save_total_limit=200 --save_steps=10000 --load_best_model_at_end True --evaluation_strategy "steps"
  2. finish first training epoch and run first evaluation
  3. pops up error

INFO|trainer.py:2412] 2022-03-29 15:36:18,174 >> Running Evaluation [INFO|trainer.py:2414] 2022-03-29 15:36:18,174 >> Num examples = 8438 [INFO|trainer.py:2417] 2022-03-29 15:36:18,174 >> Batch size = 4 Traceback (most recent call last):██████████████████████████████████████████████████| 2110/2110 [08:51<00:00, 3.68it/s] File "examples/pytorch/translation/run_translation.py", line 624, in main() File "examples/pytorch/translation/run_translation.py", line 541, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\transformers\trainer.py", line 1493, in train self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\transformers\trainer.py", line 1620, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\transformers\trainer_seq2seq.py", line 70, in evaluate return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix) File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\transformers\trainer.py", line 2287, in evaluate metric_key_prefix=metric_key_prefix, File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\transformers\trainer.py", line 2528, in evaluation_loop metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels)) File "examples/pytorch/translation/run_translation.py", line 515, in compute_metrics result = metric.compute(predictions=decoded_preds, references=decoded_labels) File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\datasets\metric.py", line 430, in compute output = self._compute(inputs, compute_kwargs) File "C:\Users\Gare.cache\huggingface\modules\datasets_modules\metrics\sacrebleu\daba8f731596c6a1a68d61f20220697f68c420a55e2096b4eea8e3ffdc406d96\sacrebleu.py", line 130, in _compute **(dict(tokenize=tokenize) if tokenize else {}), File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\sacrebleu\compat.py", line 35, in corpus_bleu return metric.corpus_score(hypotheses, references) File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\sacrebleu\metrics\base.py", line 421, in corpus_score stats = self._extract_corpus_statistics(hypotheses, references) File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\sacrebleu\metrics\base.py", line 366, in _extract_corpus_statistics ref_cache = self._cache_references(references) File "C:\ProgramData\Anaconda3\envs\gare\lib\site-packages\sacrebleu\metrics\base.py", line 333, in _cache_references raise RuntimeError("Empty or None reference sentence found.") RuntimeError: Empty or None reference sentence found. 0%| | 500/3418600 [10:49<1234:06:48, 1.30s/it]

Expected behavior

Any advice that helps with it would be appreciated.

patil-suraj commented 2 years ago

Hi @Gare-Ng ,

I organize the data as I'm told in readme. Although in readme it says that it should be jsonl file, I find that jsonl file won't work and just rename it as json works

The file format is expected to be jsonl but the scripts expect the file extension to be .json.

I think the reason for this error is that, there seems to be an example which has an empty sting as the translation/target. The example scripts are kept simple and easy to adapt so they don't do any extra pre-processing to detect and remove empty examples. You should inspect the dataset to remove such problematic examples and run the scripts.

Gare-Ng commented 2 years ago

Thank you @patil-suraj, and I do write a small script to check whether there is such empty things that I guess it is string there.

which has an empty sting as the translation/target.

But it turns out that there isn't such problematic examples in my test dev and test file. I train this model without modifying json file today morning and it works every well. Vaguely I remenber it pops this errors when I decide to change a model and train. At first I think it is related to the new model, but errors remains when change model back. Till now I still get this error. I'm certain I didn't change json file. My little script is here just in case.

import jsonlines
jsonl_name=r'C:\Users\Gare\PycharmProjects\Gare\transformers\examples\pytorch\translation\test.json'
f=jsonlines.open(jsonl_name, "r")
lines=1
errors=[]
for i in f:
    if len((i['translation']['Mdn']))==0 or len((i['translation']['Cls']))==0:
        errors.append(lines)
    lines += 1
if len(errors)==0:
    print('No Error')
else:
    print(errors)
github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.