huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.08k stars 1.07k forks source link

The "/health" is so slow when generating extra-long text。 #2348

Open coderchem opened 3 months ago

coderchem commented 3 months ago

System Info

tgi 2.0.2

Information

Tasks

Reproduction

`/// GRPC health check

[instrument(skip(self))]

pub async fn health(&mut self) -> Result<HealthResponse> {
    let futures: Vec<_> = self
        .clients
        .iter_mut()
        .map(|client| client.health())
        .collect();
    join_all(futures).await.pop().unwrap()
}`

`/// Returns a client connected to the given url pub async fn connect(uri: Uri) -> Result { let channel = Channel::builder(uri).connect().await?;

    Ok(Self {
        stub: TextGenerationServiceClient::new(channel),
    })
}

/// Returns a client connected to the given unix socket
pub async fn connect_uds(path: String) -> Result<Self> {
    let channel = Channel::from_shared("http://[::]:50051".to_string())
        .unwrap()
        .connect_with_connector(tower::service_fn(move |_: Uri| {
            tokio::net::UnixStream::connect(path.clone())
        }))
        .await?;

    Ok(Self {
        stub: TextGenerationServiceClient::new(channel),
    })
}`  

这一部分在调用gprc的时候,返回结果会很慢,尤其是在调用一个超长文本,比如125k的长上下文的时候,我使用的是llama3-8B。 /health 时间会超过10s以上。 这个已经严重影响了正常使用。

Expected behavior

如题。

ErikKaum commented 3 months ago

Hi @coderchem 👋

Thank's for opening the issue!

I'm not 100% sure I understand the exact problem. But do I understand correctly that the /health endpoint becomes slower when there's an inference going on with a long text generation?