huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.93k stars 27k forks source link

Issues finetuning MBART 50 many to many #10835

Closed tuhinjubcse closed 3 years ago

tuhinjubcse commented 3 years ago

I am trying to finetune MBART50-many-to-many

python ./transformers/examples/seq2seq/run_translation.py \
    --model_name_or_path facebook/mbart-large-50-many-to-many-mmt \
    --do_train \
    --do_eval \
    --source_lang ru_RU \
    --target_lang en_XX \
    --train_file ./corpus_v2/train.json \
    --validation_file ./corpus_v2/valid.json \
    --output_dir /local/nlpswordfish/tuhin/mbart50/tst-translation \
    --per_device_train_batch_size=32 \
    --per_device_eval_batch_size=8 \
    --overwrite_output_dir \
    --predict_with_generate \
    --max_train_samples 51373 \
    --max_val_samples 6424 \
    --gradient_accumulation_steps 1\
    --num_train_epochs 8 \
    --save_strategy epoch \
    --evaluation_strategy epoch

Even though I explicitly pass Src lang as ru_RU and Target as en_XX I get an error and see my log. I tried printing Src and Tgt language


 Assigning ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN', 'af_ZA', 'az_AZ', 'bn_IN', 'fa_IR', 'he_IL', 'hr_HR', 'id_ID', 'ka_GE', 'km_KH', 'mk_MK', 'ml_IN', 'mn_MN', 'mr_IN', 'pl_PL', 'ps_AF', 'pt_XX', 'sv_SE', 'sw_KE', 'ta_IN', 'te_IN', 'th_TH', 'tl_XX', 'uk_UA', 'ur_PK', 'xh_ZA', 'gl_ES', 'sl_SI'] to the additional_special_tokens key of the tokenizer
 Src lang is  en_XX
 ids [250004]
 ids [2]
 loading weights file https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt/resolve/main/pytorch_model.bin from cache at /home/tuhin.chakr/.cache/huggingface/transformers/e33fcda1a71396b8475e16e2fe1458cfa62c6013f8cb3787d6aa4364ec5251c6.d802a5ca7720894045dd2c9dcee6069d27aa92fbbe33f52b44d479538dc3ccc3
 All model checkpoint weights were used when initializing MBartForConditionalGeneration.

 All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-50-many-to-many-mmt.
 If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training.
 Tgt lang is  None
 self.prefix_tokens is [None]
 ids [None]
 Traceback (most recent call last):
   File "./transformers/examples/seq2seq/run_translation.py", line 564, in <module
     main()
   File "./transformers/examples/seq2seq/run_translation.py", line 403, in main
     train_dataset = train_dataset.map(
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1289, in map
     update_data = does_function_return_dict(test_inputs, test_indices)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1260, in does_function_return_dict
     function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
   File "./transformers/examples/seq2seq/run_translation.py", line 384, in preprocess_function
     with tokenizer.as_target_tokenizer():
   File "/home/tuhin.chakr/yes/lib/python3.8/contextlib.py", line 113, in __enter__
     return next(self.gen)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 242, in as_target_tokenizer
     self.set_tgt_lang_special_tokens(self.tgt_lang)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 269, in set_tgt_lang_special_tokens
     prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 287, in convert_ids_to_tokens
     index = int(index)
 TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Also as far I understand in many to many for finetuning it requires some separate processing based on the paper which is missing ?

image

What should be the data format. Additionally will u guys release a many to one model as well ? although many to one is a subset of many to many

@patrickvonplaten, @patil-suraj

tuhinjubcse commented 3 years ago

@patil-suraj any help is appreciated

tuhinjubcse commented 3 years ago

@patrickvonplaten

tuhinjubcse commented 3 years ago

Can anyone look into this @patil-suraj @patrickvonplaten

patil-suraj commented 3 years ago

Hi @tuhinjubcse , sorry to reply only now, I've been a bit busy with the sprint and other projects so couldn't really allocate any time for this. I will get back to you by tomorrow.

Also please don't tag people who are not related to this model, it might disturb them unnecessarily.

Thank you for your patience.

tuhinjubcse commented 3 years ago

Thank you, it would be good to know how to finetune a many to many models with more than one lang pairs in train and validation like fairseq multilingual

https://github.com/pytorch/fairseq/tree/master/examples/multilingual

patil-suraj commented 3 years ago

Okay, one issue at a time

I'm taking a look at the error that you posted above.

Also, the many-to-one model was not released when we ported this model to Transformers, it seems to have been released recently. I will convert and push it by tomorrow.

And regarding multi-lingual fine-tuning, I will try to write a notebook about it. What we need to do here is, say we are fine-tuning on two language pairs, in that case, we need to concatenate the two datasets or in case the two language pairs don't have the same number of examples then add some sort of sampler which will sample the example from the datasets depending on the number of examples in which one. And when processing each language pair, set the appropriate src_lang and tgt_lang tokens. The processing part is explained in the docs.

tuhinjubcse commented 3 years ago

That would be really helpful if you can have a notebook which documents how to do that , or even a read me , just so that its clear

tuhinjubcse commented 3 years ago

Thanks so much for your response and looking forward to use it

patil-suraj commented 3 years ago

The many to one checkpoint is now available on the hub https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt

tuhinjubcse commented 3 years ago

Thanks for releasing this. Looking forward to the instructions to do many to one finetuning as that is what this model will be superuseful for

tuhinjubcse commented 3 years ago

Any updates on how to run many to one, can we pass --source_lang ru_RU,es_XX as a ',' separated string. Sorry I am not sure if that support is available yet. Would be really helpful if you could help here. The EMNLP arxiv deadline is super close on 17th April :) I know you are busy but this would be a huge favor

patil-suraj commented 3 years ago

Multilingual fine-tuning won't be included in the example script, the goal of examples is to keep them simple and let the user extend them for custom training. I'm working on the notebook, but can probably share that on Monday.

As I said in the above comment, for multilingual fine-tuning, in the simplest case you would just need to process the two datasets by setting correct src_lang, tgt_lang tokens, the rest of the training will be similar to traditional fine-tuning.

Feel free to post the question on the forum as well, someone there might have better ideas for this.

tuhinjubcse commented 3 years ago

Thank you so much, if you post the notebook here by Monday that would solve my problem. I am trying on my own to do it as well

patil-suraj commented 3 years ago

Hi @tuhinjubcse

We just merged #11170 which now allows to fine-tune mBART-50 on single language pair using the run_translation.py script. This should resolve the issue that you posted in the first comment.

tuhinjubcse commented 3 years ago

Thanks so much

tuhinjubcse commented 3 years ago

Suraj I got multilingual to work, however, while decoding I get this error. My added token dictionary is

{"uk_UA": 250049, "mk_MK": 250036, "mn_MN": 250038, "id_ID": 250033, "he_IL": 250031, "sl_SI": 250053, "pt_XX": 250042, "hr_HR": 250032, "th_TH": 250047, "tl_XX": 250048, "pl_PL": 250040, "ka_GE": 250034, "ta_IN": 250045, "km_KH": 250035, "te_IN": 250046, "xh_ZA": 250051, "sv_SE": 250043, "sw_KE": 250044, "ps_AF": 250041, "bn_IN": 250029, "ml_IN": 250037, "az_AZ": 250027, "af_ZA": 250028, "gl_ES": 250052, "ur_PK": 250050, "mr_IN": 250039, "fa_IR": 250030}

File "translate.py", line 26, in tokenizer = MBart50Tokenizer.from_pretrained(path) File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1704, in from_pretrained return cls._from_pretrained( File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1810, in _from_pretrained assert index == len(tokenizer), ( AssertionError: Non-consecutive added token 'bn_IN' found. Should have index 250054 but has index 250029 in saved vocabulary.

The error comes from MBart50Tokenizer model = MBartForConditionalGeneration.from_pretrained(path) model.eval() model.to('cuda') tokenizer = MBart50Tokenizer.from_pretrained(path)

It works fine with MBartTokenizer

I can use MBartTokenizer for common languages in mbart25 and mbart50 for my manytoone model but for languages like pt_XX i can't .

patil-suraj commented 3 years ago

HI @tuhinjubcse

Glad you got it working.

And this seems like a bug, I will take a look. How many new tokens did you add?

patil-suraj commented 3 years ago

I tried adding tokens using the add_tokens and add_special_tokens method, saved and loaded it again, I didn't observe this issue.

Here's what I did

tok = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50")

tok.add_special_tokens({"MY_XX": "MY_XX"})
tok.add_special_tokens({"additional_special_tokens": ["MY2_XX"]})

tok.save_pretrained("./tmp")

tok = MBart50Tokenizer.from_pretrained("./tmp")
tok.convert_tokens_to_ids("MY_XX") # 250054
tok.convert_tokens_to_ids("MY2_XX") # 250055
github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patil-suraj commented 3 years ago

Unstale

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.