Open kernelmachine opened 1 year ago
just a few datapoints from OpenLM, with default hparams: we get ~2.5K tokens/sec/GPU on 256 A100s for OpenLM-7B, and ~9.5K tokens/sec/GPU on 128 A100s for OpenLM-1B. ~11.5K tokens/sec/GPU on 32 A100s for 1B. 7B model gets ~2700 tokens/sec/GPU on one node
This could also be useful: https://github.com/mosaicml/llm-foundry/tree/main/scripts/train/benchmarking
@achalddave is making fantastic progress here. When we're done, let's add the benchmarking results to the repository so others can compare numbers and check whether they locally get the expected performance.
See #29 for improvements. Basically we are now matching the mosaicml numbers on 1 node (~4000 tok/s/gpu for a 7b model with batch size 16, seq length 2048, on 8 A100). Will close this once we test convergence on a large run.
Would be great to benchmark tokens/sec of OpenLM, comparing to other libraries like Mosaic, Metaseq, etc.