langfuse / langfuse-k8s

Community-maintained Kubernetes config and Helm chart for Langfuse
https://langfuse.com
MIT License
47 stars 25 forks source link

pod all restart with last version chart #30

Open didlawowo opened 2 days ago

didlawowo commented 2 days ago

i have deployed with helm previous version was working like a charm and got some restart / error

Name:             langfuse-556d667545-z2gsx
Namespace:        mlops
Priority:         0
Service Account:  langfuse
Node:             rtx/192.168.1.29
Start Time:       Wed, 23 Oct 2024 12:19:19 +0200
Labels:           app.kubernetes.io/instance=langfuse
                  app.kubernetes.io/name=langfuse
                  kuik.enix.io/managed=true
                  pod-template-hash=556d667545
Annotations:      kuik.enix.io/rewrite-images: true
Status:           Running
IP:               10.0.3.146
IPs:
  IP:           10.0.3.146
Controlled By:  ReplicaSet/langfuse-556d667545
Containers:
  langfuse:
    Container ID:   containerd://4ce8023b1b941f97bfd814dbb7a2ef85a312bdb4161a03b410faddb757a33ae3
    Image:          ghcr.io/langfuse/langfuse:2
    Image ID:       ghcr.io/langfuse/langfuse@sha256:bd6b98db2706a16529ef0b59b618463bef1cc0be2b60cfa776273a06595977a0
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 23 Oct 2024 19:02:19 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    143
      Started:      Wed, 23 Oct 2024 18:59:04 +0200
      Finished:     Wed, 23 Oct 2024 18:59:31 +0200
    Ready:          True
    Restart Count:  8
    Liveness:       http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/api/public/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      NODE_ENV:                      production
      HOSTNAME:                      0.0.0.0
      PORT:                          3000
      DATABASE_USERNAME:             postgres
      DATABASE_PASSWORD:             <set to the key 'postgres-password' in secret 'langfuse-postgresql'>  Optional: false
      DATABASE_HOST:                 langfuse-postgresql
      DATABASE_NAME:                 postgres_langfuse
      NEXTAUTH_URL:                  https://langfuse.dc-tech.work
      NEXTAUTH_SECRET:               <set to the key 'nextauth-secret' in secret 'langfuse-nextauth'>  Optional: false
      SALT:                          changeme
      TELEMETRY_ENABLED:             true
      NEXT_PUBLIC_SIGN_UP_DISABLED:  false
      ENABLE_EXPERIMENTAL_FEATURES:  false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bsqvg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  kube-api-access-bsqvg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Normal   Pulling    20m (x4 over 6h44m)     kubelet  Pulling image "ghcr.io/langfuse/langfuse:2"
  Normal   Pulled     20m                     kubelet  Successfully pulled image "ghcr.io/langfuse/langfuse:2" in 3.234s (3.247s including waiting). Image size: 218575921 bytes.
  Normal   Created    20m (x4 over 6h43m)     kubelet  Created container langfuse
  Normal   Started    20m (x4 over 6h43m)     kubelet  Started container langfuse
  Warning  Unhealthy  20m (x2 over 6h43m)     kubelet  Liveness probe failed: Get "http://10.0.3.146:3000/": dial tcp 10.0.3.146:3000: connect: connection refused
  Normal   Killing    20m (x2 over 21m)       kubelet  Container langfuse failed liveness probe, will be restarted
  Warning  Unhealthy  11m (x6 over 21m)       kubelet  Liveness probe failed: Get "http://10.0.3.146:3000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  11m (x9 over 21m)       kubelet  Readiness probe failed: Get "http://10.0.3.146:3000/api/public/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  6m52s (x40 over 6h43m)  kubelet  Readiness probe failed: Get "http://10.0.3.146:3000/api/public/health": dial tcp 10.0.3.146:3000: connect: connection refused
  Warning  BackOff    108s (x14 over 6h43m)   kubelet  Back-off restarting failed container langfuse in pod langfuse-556d667545-z2gsx_mlops(50343193-a7c6-4f42-8738-85177df5abef)
dosubot[bot] commented 2 days ago

Hey there, @didlawowo! I'm here to help with any bugs, questions, or contributions you have. Let's tackle this issue together!

The pod restarts and errors you're experiencing with the latest version of the Langfuse Helm chart seem to be related to the liveness and readiness probes failing. The logs indicate that the probes are unable to connect to the specified endpoints, resulting in the pod being marked as unhealthy and subsequently restarted.

Here are a few steps you can take to troubleshoot and resolve the issue:

  1. Check Service Availability: Ensure that the Langfuse service is actually running and accessible on the expected port (3000). You can try accessing the service directly from within the cluster using a tool like curl to verify connectivity.

  2. Increase Probe Timeouts: The current timeout for both liveness and readiness probes is set to 1 second. This might be too short if the service takes longer to respond. Consider increasing the timeout to give the service more time to respond.

  3. Review Recent Changes: If the previous version was working fine, review any changes made in the latest version of the Helm chart that might affect the service's startup time or resource usage. There might be changes that require additional configuration or resources.

  4. Check Resource Limits: Ensure that the pod has sufficient CPU and memory resources allocated. Insufficient resources can lead to slow startup times and probe failures.

  5. Database Connectivity: Since Langfuse relies on a Postgres database, ensure that the database is accessible and properly configured. Any issues with the database connectivity can affect the readiness of the application.

  6. Logs and Metrics: Check the application logs and any available metrics for more detailed error messages or performance issues that might be causing the probes to fail.

By addressing these areas, you should be able to identify and resolve the cause of the pod restarts.

To continue talking to Dosu, mention @dosu.

marcklingen commented 2 days ago

can you share logs from the container instance?

Steffen911 commented 18 hours ago

@didlawowo Could you also share the resources that you've allocated to the pod? How much traffic does it usually receive over a minute?