microsoft / fastseq

An efficient implementation of the popular sequence models for text generation, summarization, and translation tasks. https://arxiv.org/pdf/2106.04718.pdf
MIT License
433 stars 40 forks source link

NMT models speedup abnormally related to batch size #106

Open dearchill opened 3 years ago

dearchill commented 3 years ago

Hi, Thanks for the great work. I just tested the fairseq-generate in my test set(ZH-EN translation) using the FastSeq and Fairseq, and the speedup is quiet abnormal comparing with the example link. My test set has 1526 sentences with 5~150 Chinese characters each, and my experiment is on NVIDIA Tesla T4. The translation model I used is base transformer arch in fairseq, with encoder layer nums equals to 30. I tested with following command: for fairseq, fairseq-generate ../data-bin --path model_avg.pt --remove-bpe --batch-size 128 for fastseq, fastseq-generate-for-fairseq ../data-bin --path model_avg.pt --remove-bpe --batch-size 128 --postprocess-workers 5 I didn't use the --no-repeat-ngram-size in fastseq, and the beam size is default 5, lenpen is 1. My test result is as follows:

BatchSize not assigned 128 10 5 1
fairseq-0.10.2 65.79 sentences/s 63.18 sentences/s 19.06 sentences/s 11.79 sentences/s 3.06 sentences/s
above + fastseq 75.55 sentences/s 74.28 sentences/s 17.38 sentences/s 11.47 sentences/s 2.92 sentences/s

I found when the batch size is large(such as 128 and above), the fastseq has obvious speedup(but not as much as 2x or above), but when the batch size is small( I test this because of my need for model used in actual situation for deployment), the fastseq seems like behaving no speedup at all, and even slower. I think the phenomenon quiet abnormal and ask for your help. Looking for your reply.

yuyan2do commented 3 years ago

Hi dearchill, thanks for your question. If you are using the latest version, add --required-seq-len-multiple 8 will make both baseline and treatment faster. When the batch size and input length are small, the gain is usually smaller. If you provide a runnable case, we could look into it further.

dearchill commented 3 years ago

Hi dearchill, thanks for your question. If you are using the latest version, add --required-seq-len-multiple 8 will make both baseline and treatment faster. When the batch size and input length are small, the gain is usually smaller. If you provide a runnable case, we could look into it further.

Hi, I added --required-seq-len-multiple 8 arg and no gains, it's so weird. I'll continue test with some other models and test set to see effects, and hereby post them if I have some findings.