huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.19k stars 27.06k forks source link

Run_summarization not working for mbart50 #11516

Closed Aniruddha-JU closed 3 years ago

Aniruddha-JU commented 3 years ago

Who can help

@patil-suraj @LysandreJik Models: mbart

I am running the run_summarization.py class using below commands: python examples/pytorch/summarization/run_summarization.py --model_name_or_path facebook/mbart-large-50 --do_train --do_eval --do_predict --test_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --train_file /home/aniruddha/mbart/mbart_json/bentrain_mbart.json --validation_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --text_column text --summary_column summary --output_dir mbart50_bengali-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=2 --overwrite_output_dir true --source_prefix "summarize: " --predict_with_generate yes

My dataset in json below format: I am doing it for bengali language: {"text": "I'm sitting here in a boring room. It's just another rainy Sunday afternoon. I'm wasting my time I got nothing to do. I'm hanging around I'm waiting for you. But nothing ever happens. And I wonder", "summary": "I'm sitting in a room where I'm waiting for something to happen"} Error:

File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 295, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

patil-suraj commented 3 years ago

Hi @Aniruddha-JU

Right now the run_summarization.py does not support fine-tuning mBART for summarization, we need to set the proper language tokens for mBART50. For now, you could easily modify the script to adapt it for mBART50 by setting the correct language tokens, as is done in the translation example.

https://github.com/huggingface/transformers/blob/master/examples/pytorch/translation/run_translation.py#L340-L380

The difference here would be that the source and target language will be similar.

Also, could you please post the full stack trace the error seems unrelated to mBART.

Aniruddha-JU commented 3 years ago

All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-50. If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training. 0%| | 0/3 [00:00<?, ?ba/s] Traceback (most recent call last): File "run_summarization.py", line 596, in main() File "run_summarization.py", line 428, in main train_dataset = train_dataset.map( File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1474, in map return self._map_single( File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 174, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/datasets/fingerprint.py", line 340, in wrapper out = func(self, *args, *kwargs) File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1798, in _map_single batch = apply_function_on_filtered_inputs( File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1706, in apply_function_on_filtered_inputs function(fn_args, effective_indices, fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs) File "run_summarization.py", line 409, in preprocess_function with tokenizer.as_target_tokenizer(): File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/contextlib.py", line 113, in enter return next(self.gen) File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 210, in as_target_tokenizer self.set_tgt_lang_special_tokens(self.tgt_lang) File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 235, in set_tgt_lang_special_tokens prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens) File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 295, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Aniruddha-JU commented 3 years ago

@patil-suraj

Aniruddha-JU commented 3 years ago

For translation json format is not supporting. core-dumped is happening.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

rahul765 commented 2 years ago

with self.tokenizer.as_target_tokenizer(): File "/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/contextlib.py", line 112, in enter return next(self.gen) File "/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py", line 215, in as_target_tokenizer self.set_tgt_lang_special_tokens(self.tgt_lang) File "/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py", line 240, in set_tgt_lang_special_tokens prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens) File "/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 307, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'