Closed hugoabonizio closed 6 months ago
This PR updates MariTal Local docs and add a new benchmark tool for local server.
Some results:
edit: Updated benchmark values using 550 input tokens and 150 output tokens to make it easier to compare with LLMPerf results.
This PR updates MariTal Local docs and add a new benchmark tool for local server.
Some results:
MariTalk-small on 1xA100 40GB
- Total tokens: 167.2 tokens/s - Generated tokens: 42.2 tokens/s ```console $ python benchmark.py --concurrency 1,2,4,8 --n-repeats 5 --prompt-size 550 --max-tokens 150 generated_tps total_tps mean median std mean median std concurrency 1 42.2 42.2 0.3 167.1 167.2 0.7 2 24.2 24.2 1.0 101.4 101.4 0.9 4 13.0 13.2 1.0 56.5 57.1 2.5 8 7.1 7.2 0.6 30.8 30.9 0.6 System tokens median std concurrency 1 167.2 0.7 2 202.9 0.1 4 230.4 10.3 8 245.8 11.9 ``` ![benchmark-small-1xa100](https://github.com/maritaca-ai/maritalk-api/assets/1206395/7acfb1c6-b2a2-40e4-b6e7-2ed08201819d)MariTalk-small on 2xA100 40GB
- Total tokens: 213.5 tokens/s - Generated tokens: 54.0 tokens/s ```console $ python benchmark.py --concurrency 1,2,4,8 --n-repeats 5 --prompt-size 550 --max-tokens 150 generated_tps total_tps mean median std mean median std concurrency 1 54.0 53.6 0.8 213.5 208.1 12.3 2 33.2 33.1 1.4 135.6 135.6 1.2 4 20.1 20.6 1.3 85.2 85.6 1.2 8 11.1 11.2 0.9 48.1 48.3 0.9 System tokens median std concurrency 1 208.1 12.3 2 271.3 0.3 4 340.8 0.8 8 384.7 1.0 ``` ![benchmark-small-2xa100](https://github.com/maritaca-ai/maritalk-api/assets/1206395/524a0b74-7998-4f24-928d-61ae803b98eb)MariTalk-medium on 2xA100 40GB
- Total tokens: 79.3 tokens/s - Generated tokens: 18.6 tokens/s ```console $ python benchmark.py --concurrency 1,2,4,8 --n-repeats 5 --prompt-size 550 --max-tokens 150 --tokenizer maritaca-ai/maritalk-tokenizer-large generated_tps total_tps mean median std mean median std concurrency 1 18.6 18.6 0.3 79.3 78.9 1.0 2 10.4 10.5 0.4 44.9 45.0 0.6 4 5.8 5.8 0.2 25.5 25.5 0.2 8 3.1 3.1 0.2 13.6 13.7 0.2 System tokens median std concurrency 1 78.9 1.0 2 90.1 1.1 4 101.9 0.1 8 108.9 0.2 ``` ![benchmark-medium-2xa100](https://github.com/maritaca-ai/maritalk-api/assets/1206395/a379f94b-4472-4eeb-b166-d262bf853a1c)Sabiá-2 Small (GPU 1xA10 24GB)
- Total tokens: 89.8 tokens/s - Generated tokens: 21.3 tokens/s ```console $ python benchmark.py --concurrency 1,2,4,8 --n-repeats 5 --prompt-size 550 --max-tokens 150 generated_tps total_tps mean median std mean median std concurrency 1 21.3 21.3 0.2 89.8 88.6 2.2 2 11.3 11.2 0.4 47.7 48.0 0.9 4 5.8 5.9 0.3 24.4 24.5 0.3 8 2.9 2.9 0.2 12.2 12.2 0.2 System tokens median std concurrency 1 88.6 2.2 2 96.4 1.9 4 97.6 0.2 8 97.5 0.2 ``` ![benchmark-small-1xa10](https://github.com/maritaca-ai/maritalk-api/assets/1206395/14ddbf13-b978-46e5-8cfc-2ebeea20f9c1)edit: Updated benchmark values using 550 input tokens and 150 output tokens to make it easier to compare with LLMPerf results.