[Feature request] Performance and accuracy benchmarks

brianyu-nexusflowai commented 7 months ago

Hi Huggingface Nanotron team!

Can I request some tooling surrounding nanotron regarding how fast it is compared to other LM training frameworks e.g. FSDP, Deepspeed, and Megatron-LM? It would be great to have performance metrics under different training workloads e.g. llama 2 7/13/34/70B x seq len 2048/4096/8192 x global batch size 128/4096. The metrics I'm interested in include seconds/step, peak GPU mem usage, and communication time.

Additionally, can I request some end-to-end tests involving finetuning an LM on a dataset and evaluating the downstream performance on a difficult task? An example is finetuning Llama 2 7B on Open-Platypus dataset and evaluating it on the OpenLLM leaderboard benchmarks. Ideally these e2e tests would also be a script that could be sanity run on any new nanotron docker setup to reproduce the performance.

I know this is a lot to ask, especially when I'm not in a personal position to contribute. Thank you so much!

Cheers, Brian

NouamaneTazi commented 7 months ago

Hello @brianyu-nexusflowai! Thanks for your interest. We can try to run some of these benchs for you. How would you measure communication time?

brianyu-nexusflowai commented 7 months ago

Hi Nouamane!

Thanks for the response. Maybe something similar to the Deepspeed's metric regarding their all_gather/all_reduce time taken e.g. https://github.com/microsoft/DeepSpeed/blob/0a10bd427e035cbd185c2d44346996e8c1a0b42d/deepspeed/runtime/engine.py#L1996-L2002. I'm not sure what the communication primitive used in nanotron is, but it would be great to have some measure of how long these operations take!

Cheers, Brian

huggingface / nanotron

[Feature request] Performance and accuracy benchmarks #61