How were the numbers in the performance leaderboard benchmarked?

jaywonchung commented 1 year ago

Thanks for putting the leaderboard up. I was just curious about the performance numbers there. Could you comment on how the performance numbers in the leaderboard in https://aviary.anyscale.com/ were generated?

For instance, what GPU was used? For larger ones, was distributed inference used? Can we run the same benchmark using the code in this repository? Were they run with batch size 1?

Thanks a lot.

Yard1 commented 1 year ago

@jaywonchung you can check the models folder for details - see the scaling_config part for information on how many workers, GPUs per worker and what GPU types were used for each model. You can also find the max batch size in the model files as well. The performance is being updated real time with each query made by the users. You can deploy Aviary yourself to reproduce the results - the configuration in this repository is exactly what we are using for the website.

For the llama-based models (which we are not sharing due to the need for delta weights), we used 2 A10s, deepspeed tensor parallelism and batch size of 6.

jaywonchung commented 1 year ago

That's really nice. Thank you for your detailed answer!

ray-project / ray-llm

How were the numbers in the performance leaderboard benchmarked? #17