Open coderchem opened 3 months ago
Hi @coderchem 👋
Thank's for opening the issue!
I'm not 100% sure I understand the exact problem. But do I understand correctly that the /health
endpoint becomes slower when there's an inference going on with a long text generation?
System Info
tgi 2.0.2
Information
Tasks
Reproduction
`/// GRPC health check
[instrument(skip(self))]
`/// Returns a client connected to the given url pub async fn connect(uri: Uri) -> Result {
let channel = Channel::builder(uri).connect().await?;
这一部分在调用gprc的时候,返回结果会很慢,尤其是在调用一个超长文本,比如125k的长上下文的时候,我使用的是llama3-8B。 /health 时间会超过10s以上。 这个已经严重影响了正常使用。
Expected behavior
如题。