LLM service unable to reach TGI service

Sometimes (always, as far as I can tell in my env) we will see that chatqna-llm can't connect to chatqna-tgi which prevents it from becoming ready. Everything looks fine otherwise, in both llm and tgi services.

kubectl get pods:

chatqna-599d54cf5d-bfqw2                   1/1     Running     0                 7m36s
chatqna-embedding-usvc-7dc66bb8fb-n7pr5    1/1     Running     0                 7m36s
chatqna-llm-uservice-589477686b-lhtdb      0/1     Running     0                 7m36s
chatqna-redis-vector-db-6b8d5445f5-dqzjq   1/1     Running     0                 7m36s
chatqna-reranking-usvc-dc4cd8777-nlh2d     1/1     Running     0                 7m36s
chatqna-retriever-usvc-64cd465f58-b9zdw    1/1     Running     3 (6m56s ago)     7m36s
chatqna-tei-5c89cb855f-qm5sm               1/1     Running     0                 7m36s
chatqna-teirerank-6f8cb58db9-t7f4q         1/1     Running     0                 7m36s
chatqna-tgi-b5768b68d-qxn96                1/1     Running     0                 7m36s

When we check the logs of chatqna-llm-uservice, everything looks fine:

kubectl logs chatqna-llm-uservice-589477686b-lhtdb:

/usr/local/lib/python3.11/site-packages/pydantic/_internal/_fields.py:149: UserWarning: Field "model_name_or_path" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
[2024-06-24 19:50:36,324] [    INFO] - CORS is enabled.
[2024-06-24 19:50:36,325] [    INFO] - Setting up HTTP server
[2024-06-24 19:50:36,325] [    INFO] - Uvicorn server setup on port 9000
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
[2024-06-24 19:50:36,338] [    INFO] - HTTP server setup successful

but not this:

kubectl describe pod chatqna-llm-uservice-589477686b-lhtdb:

  Warning  Unhealthy  11m (x2 over 11m)  kubelet  Startup probe failed:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (7) Failed to connect to chatqna-tgi port 80 after 4 ms: Couldn't connect to server
  Warning  Unhealthy  3m3s (x62 over 8m8s)  kubelet  Startup probe failed: command "curl http://chatqna-tgi" timed out

Furthermore, in this state, we can't actually talk to ChatQnA, simply getting Internal Server Error if anything is sent to :8888/v1/chatqna.

The way I've found to fix it is simply to go into the llm service and manually curl the chatqna-tgi service, which seems to somehow unblock the networking path to TGI:

kubectl exec chatqna-llm-uservice-589477686b-bskzn -it bash
curl chatqna-tgi

And now everything works including talking to ChatQnA.

kubectl get pods:

chatqna-599d54cf5d-bfqw2                   1/1     Running     0               15m
chatqna-embedding-usvc-7dc66bb8fb-n7pr5    1/1     Running     0               15m
chatqna-llm-uservice-589477686b-lhtdb      1/1     Running     1 (4m57s ago)   15m
chatqna-redis-vector-db-6b8d5445f5-dqzjq   1/1     Running     0               15m
chatqna-reranking-usvc-dc4cd8777-nlh2d     1/1     Running     0               15m
chatqna-retriever-usvc-64cd465f58-b9zdw    1/1     Running     3 (14m ago)     15m
chatqna-tei-5c89cb855f-qm5sm               1/1     Running     0               15m
chatqna-teirerank-6f8cb58db9-t7f4q         1/1     Running     0               15m
chatqna-tgi-b5768b68d-qxn96                1/1     Running     0               15m

opea-project / GenAIInfra

LLM service unable to reach TGI service #126