Add a random seed to make the benchmarking reproducible.
Currently it computes decoding throughput by decoding_step / end_to_end_latency. However, one decoding step may generate multiple tokens. This PR updates the decoding throughput after getting the correct number of output tokens.
decoding_step / end_to_end_latency
. However, one decoding step may generate multiple tokens. This PR updates the decoding throughput after getting the correct number of output tokens.cc @rickyyx