This PR adds a script to benchmark MariTalk Local. These are the results on a 4xL4 machine with a 1,000 tokens prompt and asking to generate 500 tokens:
generated_tps is tokens/s considering only output tokens, total_tps is total tokens/s (input + output) and queue_time is the time waiting for a GPU to be available (~0 because we have 4 GPUs and at most 4 concurrent clients).
The variation in total_tps is due to the model generating less than 500 tokens.
This PR adds a script to benchmark MariTalk Local. These are the results on a 4xL4 machine with a 1,000 tokens prompt and asking to generate 500 tokens:
generated_tps
is tokens/s considering only output tokens,total_tps
is total tokens/s (input + output) andqueue_time
is the time waiting for a GPU to be available (~0 because we have 4 GPUs and at most 4 concurrent clients).total_tps
is due to the model generating less than 500 tokens.