Open rlancemartin opened 1 year ago
@joehoover any ideas on what may be happening?
Wow, 195.8 sec is massive! @joehoover could this be a cold start problem? cc @bfirsh @mattt for visibility.
Hey @dankolesnikov and @rlancemartin, sorry for the delay! @dankolesnikov, I was thinking the same thing; however, I just checked and we have the model set to always on.
@rlancemartin, have you noticed any patterns that might be consistent with the delay being caused by cold starts? E.g., any sense of how long you need to wait for a request to be an "initial request" instead of a "subsequent request"?
Also, if you could share the model version ID and the prediction ID for a slow response, I'll try to identify a root cause.
We are using the replicate integration with LangChain:
We are benchmarking latency for question-answering using LangChain auto-evaluator app: https://autoevaluator.langchain.com/playground
I run several inference calls and measure latency of each:
We see very high inference latency (e.g.,
195 sec
) for the initial call.But, subsequent calls are much faster < 10 sec.
This is consistent across runs.
For example, another run today:
With additional logging, I confirmed that latency is indeed from calling the endpoint.
Why is this?
It hurts the latency assessment of Vicuna-13b relative to other models: