Open lukehsiao opened 6 months ago
Can you provide some examples of how you’re using SSE and what metrics you’d want to measure? What technologies are you using?
Hmmm, I can try and get more specific if you need, but one example would be load testing an API like ChatGPT, which uses SSE so that you can start to see the response streaming back as it is generated, rather than simply staring at a blank page for a long time before the entire response is complete.
In these types of use cases, time-to-first-token (essentially time-to-first-byte) is the interesting metric, as that represents the latency between asking a query and when the user can begin to receive a response. This metric is often what dictates how responsive a streaming LLM API feels to a user.
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices says more about this if you go to the "Important Metrics for LLM Serving" heading.
Our team uses four key metrics for LLM serving:
- Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
So, the question then is: can goose
be used to load-test and measure a time-to-first-byte as a proxy for time-to-first-token? Could I use goose
to try and reproduce some of the results in this Databricks blog post?
Does that help clarify?
That’s very helpful, yes. I’ll find some time to test and see what can be done. I expect it will take some code changes/additions to be useful.
Suppose we want to load-test an API which uses server-sent events (SSE). Is it possible to measure the time-to-first-byte using Goose?