opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
265 stars 187 forks source link

[Bug] ChatQnA not working on stock Kubernetes cluster #725

Closed arun-gupta closed 2 months ago

arun-gupta commented 2 months ago

Priority

Undecided

OS type

Ubuntu

Hardware type

Xeon-SPR

Installation method

Deploy method

Running nodes

Single Node

What's the version?

latest tag per the Helm chart at https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/kubernetes/manifests/xeon/chatqna.yaml.

Description

Deploying ChatQnA on Kubernetes following the instructions at https://github.com/opea-project/GenAIExamples/tree/main/ChatQnA/kubernetes/manifests. The following pods are in ContainerStarting phase and not getting fully started:

@mkbhanda

Reproduce steps

Here are the exact steps: https://gist.github.com/arun-gupta/fd3793baadc9feb4c3883c80b9481161

Raw log

ec2-user:~/environment:$ kubectl get all 
NAME                                           READY   STATUS              RESTARTS   AGE
pod/chatqna-79d8c5ffff-m2fb9                   1/1     Running             0          8m29s
pod/chatqna-data-prep-77dcc665f4-gjj7t         1/1     Running             0          8m30s
pod/chatqna-embedding-usvc-55d4dc8f67-6qrln    1/1     Running             0          8m30s
pod/chatqna-llm-uservice-66cc67785-vkpc9       1/1     Running             0          8m30s
pod/chatqna-redis-vector-db-5dcd98f579-x7k9q   1/1     Running             0          8m30s
pod/chatqna-reranking-usvc-759bf96c5c-fl6f8    1/1     Running             0          8m30s
pod/chatqna-retriever-usvc-86f8dfbfb6-pfktk    1/1     Running             0          8m30s
pod/chatqna-tei-565488dd9-p4cj7                0/1     ContainerCreating   0          8m30s
pod/chatqna-teirerank-6c9854cfdf-mmgqh         0/1     ContainerCreating   0          8m30s
pod/chatqna-tgi-587b54f5ff-fcfqn               0/1     ContainerCreating   0          8m29s

NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/chatqna                   ClusterIP   172.20.167.162   <none>        8888/TCP            8m30s
service/chatqna-data-prep         ClusterIP   172.20.240.173   <none>        6007/TCP            8m30s
service/chatqna-embedding-usvc    ClusterIP   172.20.194.245   <none>        6000/TCP            8m30s
service/chatqna-llm-uservice      ClusterIP   172.20.70.157    <none>        9000/TCP            8m30s
service/chatqna-redis-vector-db   ClusterIP   172.20.165.213   <none>        6379/TCP,8001/TCP   8m30s
service/chatqna-reranking-usvc    ClusterIP   172.20.112.188   <none>        8000/TCP            8m30s
service/chatqna-retriever-usvc    ClusterIP   172.20.204.167   <none>        7000/TCP            8m30s
service/chatqna-tei               ClusterIP   172.20.116.54    <none>        80/TCP              8m30s
service/chatqna-teirerank         ClusterIP   172.20.22.103    <none>        80/TCP              8m30s
service/chatqna-tgi               ClusterIP   172.20.10.36     <none>        80/TCP              8m30s
service/kubernetes                ClusterIP   172.20.0.1       <none>        443/TCP             24m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/chatqna                   1/1     1            1           8m30s
deployment.apps/chatqna-data-prep         1/1     1            1           8m30s
deployment.apps/chatqna-embedding-usvc    1/1     1            1           8m30s
deployment.apps/chatqna-llm-uservice      1/1     1            1           8m30s
deployment.apps/chatqna-redis-vector-db   1/1     1            1           8m30s
deployment.apps/chatqna-reranking-usvc    1/1     1            1           8m30s
deployment.apps/chatqna-retriever-usvc    1/1     1            1           8m30s
deployment.apps/chatqna-tei               0/1     1            0           8m30s
deployment.apps/chatqna-teirerank         0/1     1            0           8m30s
deployment.apps/chatqna-tgi               0/1     1            0           8m30s

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/chatqna-79d8c5ffff                   1         1         1       8m29s
replicaset.apps/chatqna-data-prep-77dcc665f4         1         1         1       8m30s
replicaset.apps/chatqna-embedding-usvc-55d4dc8f67    1         1         1       8m30s
replicaset.apps/chatqna-llm-uservice-66cc67785       1         1         1       8m30s
replicaset.apps/chatqna-redis-vector-db-5dcd98f579   1         1         1       8m30s
replicaset.apps/chatqna-reranking-usvc-759bf96c5c    1         1         1       8m30s
replicaset.apps/chatqna-retriever-usvc-86f8dfbfb6    1         1         1       8m30s
replicaset.apps/chatqna-tei-565488dd9                1         1         0       8m30s
replicaset.apps/chatqna-teirerank-6c9854cfdf         1         1         0       8m30s
replicaset.apps/chatqna-tgi-587b54f5ff               1         1         0       8m29s
ec2-user:~/environment:$ kubectl logs pod/chatqna-tei-565488dd9-p4cj7
Error from server (BadRequest): container "tei" in pod "chatqna-tei-565488dd9-p4cj7" is waiting to start: ContainerCreating
ec2-user:~/environment:$ kubectl logs pod/chatqna-teirerank-6c9854cfdf-mmgqh
Error from server (BadRequest): container "teirerank" in pod "chatqna-teirerank-6c9854cfdf-mmgqh" is waiting to start: ContainerCreating
ec2-user:~/environment:$ kubectl logs pod/chatqna-tgi-587b54f5ff-fcfqn
Error from server (BadRequest): container "tgi" in pod "chatqna-tgi-587b54f5ff-fcfqn" is waiting to start: ContainerCreating
yongfengdu commented 2 months ago

Would you also provide the output of kubectl describe? kubectl describe pod chatqna-tei-565488dd9-p4cj7

Without more logs, one issue in my mind is we often forget to set/modify the volumes path: You need to make sure you have created the directory /mnt/opea-models to save the cached model on the node where the ChatQnA workload is running. Otherwise, you need to modify the chatqna.yaml file to change the model-volume to a directory that exists on the node.

arun-gupta commented 2 months ago

I shut down the cluster and will recreate it for you.

Creating a directory /mnt/opea-models specific to a node does not seem k8s-native way. It could be a multi-node cluster and this would make it tricky. Can this be done using a PVC instead?

yongfengdu commented 2 months ago

Yes, PVC already supported from helm-charts deploy: https://github.com/opea-project/GenAIInfra/tree/main/helm-charts#using-persistent-volume

The manifests deploy is not flexible enough and we want to provide manifests with as less as possible configuration changes.(Assume PVC would require additional setup). Maybe the best way for manifests is not to set model-volume and have the model downloaded at the container startup.(We can remove the model-volume dependency if you think this way is better)

arun-gupta commented 2 months ago

Anything that requires customization outside of the Helm charts will add to developer friction and should be minimized.

Either way, the /mnt/opea-models step is not documented. I'd recommend removing it but that will add to the container startup time.

yongfengdu commented 2 months ago

This /mnt/opea-models path issue has been fixed by #745 Now by default, the tgi/tei will use a temp volume to download and save Models.