Add MariTalk Local benchmark

This PR adds a script to benchmark MariTalk Local. These are the results on a 4xL4 machine with a 1,000 tokens prompt and asking to generate 500 tokens:

            generated_tps            total_tps            queue_time          
                     mean       std       mean        std       mean       std
num_workers                                                                   
1               11.551775  0.148856  40.143781  10.920284   0.003000  0.000000
2               11.705084  0.241270  52.942721  42.269233   0.003900  0.001197
3               11.598977  0.167309  40.937140  19.785347   0.003467  0.001060
4               11.615300  0.174494  39.509272  17.519749   0.003700  0.000923

            generated_tps            total_tps             queue_time           
                     mean       std       mean         std       mean        std
num_workers                                                                     
1               11.502334  0.115886  33.730512    0.111345   0.003000   0.000000
2               11.602588  0.169303  33.926801    0.294484   0.003000   0.000000
3               11.611135  0.129473  78.183589  157.911516   0.003067   0.000258
4               11.619885  0.151930  37.738218   17.180896   0.003300   0.000923
5               11.540448  0.119938  42.551261   58.596441   7.244720  16.590633
6               11.575337  0.150861  34.154112   25.505039  12.434400  19.287535
7               11.560758  0.177267  39.628847   76.009304  17.815457  22.015438
8               11.576773  0.195011  59.968264  192.249997  18.973325  21.385482

generated_tps is tokens/s considering only output tokens, total_tps is total tokens/s (input + output) and queue_time is the time waiting for a GPU to be available (~0 because we have 4 GPUs and at most 4 concurrent clients).
The variation in total_tps is due to the model generating less than 500 tokens.

maritaca-ai / maritalk-api

Add MariTalk Local benchmark #40