Closed tuhinjubcse closed 3 years ago
@patil-suraj any help is appreciated
@patrickvonplaten
Can anyone look into this @patil-suraj @patrickvonplaten
Hi @tuhinjubcse , sorry to reply only now, I've been a bit busy with the sprint and other projects so couldn't really allocate any time for this. I will get back to you by tomorrow.
Also please don't tag people who are not related to this model, it might disturb them unnecessarily.
Thank you for your patience.
Thank you, it would be good to know how to finetune a many to many models with more than one lang pairs in train and validation like fairseq multilingual
https://github.com/pytorch/fairseq/tree/master/examples/multilingual
Okay, one issue at a time
I'm taking a look at the error that you posted above.
Also, the many-to-one model was not released when we ported this model to Transformers
, it seems to have been released recently. I will convert and push it by tomorrow.
And regarding multi-lingual fine-tuning, I will try to write a notebook about it. What we need to do here is, say we are fine-tuning on two language pairs, in that case, we need to concatenate the two datasets or in case the two language pairs don't have the same number of examples then add some sort of sampler which will sample the example from the datasets depending on the number of examples in which one. And when processing each language pair, set the appropriate src_lang
and tgt_lang
tokens. The processing part is explained in the docs.
That would be really helpful if you can have a notebook which documents how to do that , or even a read me , just so that its clear
Thanks so much for your response and looking forward to use it
The many to one checkpoint is now available on the hub https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt
Thanks for releasing this. Looking forward to the instructions to do many to one finetuning as that is what this model will be superuseful for
Any updates on how to run many to one, can we pass --source_lang ru_RU,es_XX as a ',' separated string. Sorry I am not sure if that support is available yet. Would be really helpful if you could help here. The EMNLP arxiv deadline is super close on 17th April :) I know you are busy but this would be a huge favor
Multilingual fine-tuning won't be included in the example script, the goal of examples is to keep them simple and let the user extend them for custom training. I'm working on the notebook, but can probably share that on Monday.
As I said in the above comment, for multilingual fine-tuning, in the simplest case you would just need to process the two datasets by setting correct src_lang
, tgt_lang
tokens, the rest of the training will be similar to traditional fine-tuning.
Feel free to post the question on the forum as well, someone there might have better ideas for this.
Thank you so much, if you post the notebook here by Monday that would solve my problem. I am trying on my own to do it as well
Hi @tuhinjubcse
We just merged #11170 which now allows to fine-tune mBART-50 on single language pair using the run_translation.py
script. This should resolve the issue that you posted in the first comment.
Thanks so much
Suraj I got multilingual to work, however, while decoding I get this error. My added token dictionary is
{"uk_UA": 250049, "mk_MK": 250036, "mn_MN": 250038, "id_ID": 250033, "he_IL": 250031, "sl_SI": 250053, "pt_XX": 250042, "hr_HR": 250032, "th_TH": 250047, "tl_XX": 250048, "pl_PL": 250040, "ka_GE": 250034, "ta_IN": 250045, "km_KH": 250035, "te_IN": 250046, "xh_ZA": 250051, "sv_SE": 250043, "sw_KE": 250044, "ps_AF": 250041, "bn_IN": 250029, "ml_IN": 250037, "az_AZ": 250027, "af_ZA": 250028, "gl_ES": 250052, "ur_PK": 250050, "mr_IN": 250039, "fa_IR": 250030}
File "translate.py", line 26, in
The error comes from MBart50Tokenizer model = MBartForConditionalGeneration.from_pretrained(path) model.eval() model.to('cuda') tokenizer = MBart50Tokenizer.from_pretrained(path)
It works fine with MBartTokenizer
I can use MBartTokenizer for common languages in mbart25 and mbart50 for my manytoone model but for languages like pt_XX i can't .
HI @tuhinjubcse
Glad you got it working.
And this seems like a bug, I will take a look. How many new tokens did you add?
I tried adding tokens using the add_tokens
and add_special_tokens
method, saved and loaded it again, I didn't observe this issue.
Here's what I did
tok = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50")
tok.add_special_tokens({"MY_XX": "MY_XX"})
tok.add_special_tokens({"additional_special_tokens": ["MY2_XX"]})
tok.save_pretrained("./tmp")
tok = MBart50Tokenizer.from_pretrained("./tmp")
tok.convert_tokens_to_ids("MY_XX") # 250054
tok.convert_tokens_to_ids("MY2_XX") # 250055
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Unstale
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
transformers
version: LatestI am trying to finetune MBART50-many-to-many
Even though I explicitly pass Src lang as ru_RU and Target as en_XX I get an error and see my log. I tried printing Src and Tgt language
Also as far I understand in many to many for finetuning it requires some separate processing based on the paper which is missing ?
What should be the data format. Additionally will u guys release a many to one model as well ? although many to one is a subset of many to many
@patrickvonplaten, @patil-suraj