This PR fixes/updates the inference benchmarking analysis scripts to support [fastgen, vllm, aml] backends. The scripts are generalized to support models beyond just Llama, which was hardcoded in the scripts previously. A number of bugs and formatting issues are also resolved. The scripts that were fixed/updated are:
plot_effective_throughput.py
plot_latency_percentile.py:
plot_repl_scale.py:
plot_th_lat.py:
plot_tp_sizes.py:
Example plots for the scripts:
plot_effective_throughput.py:
plot_latency_percentile.py:
plot_repl_scale.py:
plot_th_lat.py:
NOTE: the resulting data should not be used to draw any conclusions. The GPUs, tp_size, etc are different across the different data points. This is simply demonstrating the plot generation.
This PR fixes/updates the inference benchmarking analysis scripts to support
[fastgen, vllm, aml]
backends. The scripts are generalized to support models beyond just Llama, which was hardcoded in the scripts previously. A number of bugs and formatting issues are also resolved. The scripts that were fixed/updated are:plot_effective_throughput.py
plot_latency_percentile.py
:plot_repl_scale.py
:plot_th_lat.py
:plot_tp_sizes.py
:Example plots for the scripts:
plot_effective_throughput.py
:plot_latency_percentile.py
:plot_repl_scale.py
:plot_th_lat.py
: NOTE: the resulting data should not be used to draw any conclusions. The GPUs, tp_size, etc are different across the different data points. This is simply demonstrating the plot generation.plot_tp_sizes.py
: