Training benchmarks reproduction

staghado commented 5 months ago

The training benchmark link no longer works : https://huggingface.co/blog/huggingface-and-optimum-amd

How can one test training throughput on AMD these days? Also, can you provide details about the experiments in the figure below: what ctx length, is this a lora?, how can you have a ddp=2 with 1xMI250, ...

IlyasMoutawwakil commented 5 months ago

optimum-benchmark is in constant change, you can find the configs that were used in https://github.com/huggingface/optimum-benchmark/tree/0.0.1/examples/training-llamas same thing for inference, there are many good examples, but maintaining them with the speed of development of everything in the ecosystem is time consuming, so we removed them for the time being.

with v0.0.1 you can only run the benchmarks from cli and results will be in the corresponding folder.
with main you can write the same benchmark using the python api and interact with your benchmark configs/reports more freely.
the ctx length is 256 reported here along all the benchmarking details https://github.com/huggingface/optimum-benchmark/blob/0.0.1/examples/training-llamas/configs/_base_.yaml#L22
ddp=2 on mi250 is possible because one mi250 chip is seen as 2 cuda devices (explained in the blogpost)
yes it is a lora, that's what the keyword peft means.

staghado commented 5 months ago

thanks for the prompt response 😄 I totally understand the need for quick development. did you try any large scale training on AMD? i don't know if that's the goal of optimum but still would be cool to know. I am asking because I am looking for a suitable codebase to benchmark some training on AMD(not LoRA).

IlyasMoutawwakil commented 4 months ago

@staghado sorry for the late response, I haven't been working on optimum-benchmark lately, you can check the new work in https://huggingface.co/blog/huggingface-amd-mi300 the goal of optimum-benchmark is to allow you to easily get metrics like training throughput, memory consumption, whether the training is possible, etc, quickly and without needing to set up the data+training pipeline. you can also compare diff config and find the one that your machine can handle or that that matches the topology of your machines most (like which tp/dp degree to use).

huggingface / optimum-benchmark

Training benchmarks reproduction #190