Improve GPU utilization for "translate" tasks

eu9ene commented 4 months ago

Currently, it's ~70%. We could try using a bigger batch but it also depends on language.

GCP console for translate-mono task)

eu9ene commented 4 months ago

It appears to be even lower for translate-corpus: GCP console

@gregtatum FYI

gregtatum commented 3 months ago

Is it possible to dynamically determine this value? Like run N translations, measure and adjust?

ZJaume commented 1 month ago

Noticed also this and it's been the same always. I think the bottleneck is decoding. Doing n-best with beam 8 it seems to make much less use of GPU than not doing n-best and about 6-4 beam.

This won't increment the use of GPU, but I've been using --fp16 during inference and training without any significant quality drop. Haven't compared n-best generation though.

ZJaume commented 1 month ago

Another alternative would be comparing with ctranslate2, that has faster inference than marian.

eu9ene commented 1 month ago

Related to #165

gregtatum commented 1 week ago

Training uses dynamic batch sizes, so it changes the batch size over time to find the best value, so there's not really a need to adjust it. It starts somewhat inefficient, but quickly dials in the number to be as efficient as it can.

Translate tasks however are not dynamic for batching size. I played with the them in #931 and got it optimized to be about as efficient as training by adjust the batching behavior. I think this 70% is just the cap for Marian's ability to utilize the GPUs. CTranslate2 was able to get ~96% utilization and was much faster given the same beam size.

It'll take a bit more time to get COMET scores for using CTranslate2 to cross-compare. CTranslate2 doesn't support ensemble decoding, so we'll have to compare with Marian single teacher decoding.

mozilla / translations

Improve GPU utilization for "translate" tasks #785