Closed Aniruddha-JU closed 3 years ago
Hi @Aniruddha-JU
Right now the run_summarization.py
does not support fine-tuning mBART for summarization, we need to set the proper language tokens for mBART50. For now, you could easily modify the script to adapt it for mBART50 by setting the correct language tokens, as is done in the translation example.
The difference here would be that the source and target language will be similar.
Also, could you please post the full stack trace the error seems unrelated to mBART.
All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-50.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training.
0%| | 0/3 [00:00<?, ?ba/s]
Traceback (most recent call last):
File "run_summarization.py", line 596, in
@patil-suraj
For translation json format is not supporting. core-dumped is happening.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
with self.tokenizer.as_target_tokenizer(): File "/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/contextlib.py", line 112, in enter return next(self.gen) File "/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py", line 215, in as_target_tokenizer self.set_tgt_lang_special_tokens(self.tgt_lang) File "/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py", line 240, in set_tgt_lang_special_tokens prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens) File "/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 307, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
transformers
4.5.0Who can help
@patil-suraj @LysandreJik Models: mbart
I am running the run_summarization.py class using below commands: python examples/pytorch/summarization/run_summarization.py --model_name_or_path facebook/mbart-large-50 --do_train --do_eval --do_predict --test_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --train_file /home/aniruddha/mbart/mbart_json/bentrain_mbart.json --validation_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --text_column text --summary_column summary --output_dir mbart50_bengali-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=2 --overwrite_output_dir true --source_prefix "summarize: " --predict_with_generate yes
My dataset in json below format: I am doing it for bengali language: {"text": "I'm sitting here in a boring room. It's just another rainy Sunday afternoon. I'm wasting my time I got nothing to do. I'm hanging around I'm waiting for you. But nothing ever happens. And I wonder", "summary": "I'm sitting in a room where I'm waiting for something to happen"} Error:
File "/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 295, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'