replicate / cog-vicuna-13b

A template to run Vicuna-13B in Cog
https://replicate.com/replicate/llama-7b
Apache License 2.0
73 stars 19 forks source link

High latency on the first inference call #7

Open rlancemartin opened 1 year ago

rlancemartin commented 1 year ago

We are using the replicate integration with LangChain:

llm = Replicate(model="replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
                input={"temperature": 0.75, "max_length": 3000, "top_p":0.25})

We are benchmarking latency for question-answering using LangChain auto-evaluator app: https://autoevaluator.langchain.com/playground

I run several inference calls and measure latency of each:

We see very high inference latency (e.g., 195 sec) for the initial call.

But, subsequent calls are much faster < 10 sec.

This is consistent across runs.

For example, another run today:

With additional logging, I confirmed that latency is indeed from calling the endpoint.

Why is this?

It hurts the latency assessment of Vicuna-13b relative to other models:

image

rlancemartin commented 1 year ago

@joehoover any ideas on what may be happening?

dankolesnikov commented 1 year ago

Wow, 195.8 sec is massive! @joehoover could this be a cold start problem? cc @bfirsh @mattt for visibility.

joehoover commented 1 year ago

Hey @dankolesnikov and @rlancemartin, sorry for the delay! @dankolesnikov, I was thinking the same thing; however, I just checked and we have the model set to always on.

@rlancemartin, have you noticed any patterns that might be consistent with the delay being caused by cold starts? E.g., any sense of how long you need to wait for a request to be an "initial request" instead of a "subsequent request"?

Also, if you could share the model version ID and the prediction ID for a slow response, I'll try to identify a root cause.