huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.65k stars 26.43k forks source link

Translation takes too long - from fine-tuned mbart-large-50 model #13458

Closed aloka2209 closed 2 years ago

aloka2209 commented 3 years ago

I have fine-tuned from, facebook mbart-large-50 for Si-En language pairs. When I try to translate 1950 sentences as (1) full batch (2) batch size=16 etc. still process crashes.

I then passed 16-lines per batch, ie. as src_lines and it takes considerable time.

Could you help on how I can reduce the translation time? My code is as follows.

Highly appreciate your help.

However from the fairseq fine-tuned checkpoint the entire file can be translated in 2 mints in the same machine.

tokenizer = MBart50TokenizerFast.from_pretrained("mbart50-ft-si-en-run4", src_lang="si_LK", tgt_lang="en_XX") model_inputs = tokenizer(src_lines, padding=True, truncation=True, max_length=100, return_tensors="pt")

generated_tokens = model.generate( **model_inputs, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])

trans_lines=tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) #crashes

LysandreJik commented 3 years ago

Hello! Could you provide a reproducible code example so that we may take a look?

aloka2209 commented 3 years ago

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

mbart50-ft-si-en-run4 the fine-tuned model directory

tokenizer = MBart50TokenizerFast.from_pretrained("mbart50-ft-si-en-run4", src_lang="si_LK", tgt_lang="en_XX") model = MBartForConditionalGeneration.from_pretrained("mbart50-ft-si-en-run4/checkpoint-21500")

The file has 1950 lines

src_lines=[line.strip() for line in open('data/parallel-27.04.2021-tu.un.si-en-ta.si', 'r', encoding='utf8')]

model_inputs = tokenizer(src_lines, padding=True, truncation=True, max_length=100, return_tensors="pt") generated_tokens = model.generate( **model_inputs, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])

trans_lines=tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

Please refer the code I am trying as above. The above code works when a subset of about 10 examples are given. ie. len(src_lines)=10. I am having trouble translating the full batch.

patil-suraj commented 3 years ago

Hi @aloka2209

Thank you for opening the issue. Could you please provide more details?

aloka2209 commented 3 years ago

Hi @patil-suraj

Thank you for your attention regarding this issue.

For the same training, tuning and testing sets I have conducted a fine-tuning from mbart-large-50 pre-trained model for Si-En languages, using fairseq and huggingface.

Fairseq fine-tuning

The preprocessing, fine-tuning & generation parameter settings have been followed according to example in github.

I have used the gpu for fine-tuning with fp16, max-tokens 1024 as parameters. Generation command as follows: cat data-spm/test_si_en.bpe.si_LK \ | fairseq-interactive $path_2_data \ --path $model \ --task translation_multi_simple_epoch \ --lang-dict $lang_list \ --lang-pairs $lang_pairs \ --source-lang $source_lang \ --target-lang $target_lang \ --batch-size 32 \ --remove-bpe 'sentencepiece' \ --buffer-size 32 \ --encoder-langtok 'src' \ --decoder-langtok \

'data/test_si_en.stdout.si_LK_en_XX'

To translate 1950 Si lines the following times are taken. gpu - 3 mints

Huggingface fine-tuning

From the model hub I obtained the facebook/mbart-large-50 model and used the following code to fine tune for Si-En languages. The above training, validation, testing sets were used.

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50") tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="si_LK", tgt_lang="en_XX")

initialized model parameters

args = Seq2SeqTrainingArguments(output_dir='mbart50-ft-si-en-run4', do_train=True, do_eval=True, evaluation_strategy="epoch", per_device_train_batch_size=16, per_device_eval_batch_size=16, learning_rate=2e-5, weight_decay=0.01, save_total_limit=12,
num_train_epochs=120, predict_with_generate=True, save_steps=500)

trainer = Seq2SeqTrainer(model=model, args=args, data_collator=data_collator, train_dataset=train_data, eval_dataset=valid_data, compute_metrics=compute_metrics, tokenizer=tokenizer)

trainer.train()

Few things I need to confirm with fine-tuning

My generation command as follows.

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

mbart50-ft-si-en-run4 the fine-tuned model directory

tokenizer = MBart50TokenizerFast.from_pretrained("mbart50-ft-si-en-run4", src_lang="si_LK", tgt_lang="en_XX") model = MBartForConditionalGeneration.from_pretrained("mbart50-ft-si-en-run4/checkpoint-21500")

model_inputs = tokenizer(src_lines, padding=True, truncation=True, max_length=100, return_tensors="pt") generated_tokens = model.generate( **model_inputs, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])

trans_lines=tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

I have tried to pass the entire batch as above. Then the task crashes

Afterwards I tried with trans_lines=tokenizer.batch_decode(generated_tokens, batch_size=32, skip_special_tokens=True). But this takes a long time. So I have stopped it.

Highly appreciate your help on improving the fine-tuning and/or generation. Many Thanks.

patil-suraj commented 3 years ago

Hi,

Please report one issue at a time, as the original issue is about generation, let's talk about that :)

Also it would be nice if you could please format code using markdown code syntax , otherwise the code is hard to read.

Looking at the generation code snippet it seems that the model is not on GPU, to put the model on GPU which could explain the slowdown.

model = MBartForConditionalGeneration.from_pretrained("mbart50-ft-si-en-run4/checkpoint-21500").to("cuda")

Also, you are passing all 2000 examples at once to tokenizer and model, which could again explain why tokenization is low. If you look at the fairseq command, it accepts a batch_size argument and does generation for one batch at a time.

If you pass all 2000 examples at once it might OOM on GPU. You could instead create a Dataset and DataLoader do generation for each batch.

Hope this helps :)

aloka2209 commented 3 years ago

I have added the fine-tuning related questions to the forum and would appreciate your answers. Sorry for getting everything here.

I have used the gpu command as suggested and now it translates fast. Thank you!

One of my intentions is to get this translated in cpu setting. So earlier I tried by feeding the inputs in 16-line batches and then by changing the batch_size parameter during generation. Still, it was taking a long time.

ie.

trans_lines=tokenizer.batch_decode(generated_tokens, batch_size=32, skip_special_tokens=True) 

Are there any suggestions where I can still cut-down the translation time in cpu?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.