jaegertracing / helm-charts

Helm Charts for Jaeger backend
Apache License 2.0
269 stars 347 forks source link

[Bug]: Unstable Jaeger Deployment with Cassandra ; Cassandra STS is failing #555

Open yitzhtal opened 8 months ago

yitzhtal commented 8 months ago

What happened?

Cassandra stateful set is not stable and keeps crashing.

Steps to reproduce

  1. Install OTEL SDK on some app.
  2. Install Jaeger latest helm chart 1.0.0.

Expected behavior

Jaeger available with alll pods running stable.

Relevant log output

│ INFO  [main] 2024-02-28 09:58:00,582 QueryProcessor.java:163 - Preloaded 0 prepared statements                                                                                                                                                                                                                             │
│ INFO  [main] 2024-02-28 09:58:00,582 StorageService.java:657 - Cassandra version: 3.11.6                                                                                                                                                                                                                                   │
│ INFO  [main] 2024-02-28 09:58:00,582 StorageService.java:658 - Thrift API version: 20.1.0                                                                                                                                                                                                                                  │
│ INFO  [main] 2024-02-28 09:58:00,582 StorageService.java:659 - CQL supported versions: 3.4.4 (default: 3.4.4)                                                                                                                                                                                                              │
│ INFO  [main] 2024-02-28 09:58:00,582 StorageService.java:661 - Native protocol supported versions: 3/v3, 4/v4, 5/v5-beta (default: 4/v4)                                                                                                                                                                                   │
│ INFO  [main] 2024-02-28 09:58:00,599 IndexSummaryManager.java:87 - Initializing index summary manager with a memory pool size of 99 MB and a resize interval of 60 minutes                                                                                                                                                 │
│ INFO  [main] 2024-02-28 09:58:00,604 MessagingService.java:750 - Starting Messaging Service on /10.50.26.33:7000 (eth0)                                                                                                                                                                                                    │
│ INFO  [main] 2024-02-28 09:58:00,619 OutboundTcpConnection.java:108 - OutboundTcpConnection using coalescing strategy DISABLED                                                                                                                                                                                             │
│ INFO  [HANDSHAKE-jaeger-solutions-cassandra-0.jaeger-solutions-cassandra.jaeger-solutions.svc.cluster.local/10.50.30.49] 2024-02-28 09:58:00,628 OutboundTcpConnection.java:561 - Handshaking version with jaeger-solutions-cassandra-0.jaeger-solutions-cassandra.jaeger-solutions.svc.cluster.local/10.50.30.49          │
│ INFO  [ScheduledTasks:1] 2024-02-28 09:58:03,885 TokenMetadata.java:517 - Updating topology for all endpoints that have changed                                                                                                                                                                                            │
│ Exception (java.lang.UnsupportedOperationException) encountered during startup: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true                                                                                                                       │
│ java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true                                                                                                                                                              │
│     at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:613)                                                                                                                                                                                                                      │
│     at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:844)                                                                                                                                                                                                                                  │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:703)                                                                                                                                                                                                                                     │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:652)                                                                                                                                                                                                                                     │
│     at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:397)                                                                                                                                                                                                                                        │
│     at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)                                                                                                                                                                                                                                     │
│     at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757)                                                                                                                                                                                                                                         │
│ ERROR [main] 2024-02-28 09:58:06,635 CassandraDaemon.java:774 - Exception encountered during startup                                                                                                                                                                                                                       │
│ java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true                                                                                                                                                              │
│     at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:613) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                │
│     at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:844) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                            │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:703) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                               │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:652) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                               │
│     at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:397) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                                   │
│     at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                                │
│     at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                                    │
│ INFO  [StorageServiceShutdownHook] 2024-02-28 09:58:06,637 HintsService.java:209 - Paused hints dispatch                                                                                                                                                                                                                   │
│ WARN  [StorageServiceShutdownHook] 2024-02-28 09:58:06,637 Gossiper.java:1655 - No local state, state is in silent shutdown, or node hasn't joined, not announcing shutdown                                                                                                                                                │
│ INFO  [StorageServiceShutdownHook] 2024-02-28 09:58:06,637 MessagingService.java:985 - Waiting for messaging service to quiesce                                                                                                                                                                                            │
│ INFO  [ACCEPT-/10.50.26.33] 2024-02-28 09:58:06,638 MessagingService.java:1346 - MessagingService has terminated the accept() thread                                                                                                                                                                                       │
│ INFO  [StorageServiceShutdownHook] 2024-02-28 09:58:06,759 HintsService.java:209 - Paused hints dispatch

Screenshot

Screenshot 2024-02-28 at 11 58 37

Additional context

Running Jaeger on a dedicated namespace on EKS.

Jaeger backend version

1.53.0

SDK

OpenTelemetry SDK.

Pipeline

No response

Stogage backend

Cassandra

Operating system

Linux

Deployment model

Kubernetes

Deployment configs

provisionDataStore:
  cassandra: true
  elasticsearch: false
  kafka: false
agent:
  enabled: false
query:
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - jaeger-ui-solutions.internal.lightrun.com
  config: |-
    {
      "dependencies": {
        "dagMaxNumServices": 200,
        "menuEnabled": true
      },
      "archiveEnabled": true,
      "tracking": {
        "gaID": "UA-000000-2",
        "trackErrors": true
      }
    }
cassandra:
  resources:
     requests:
       memory: 10Gi
       cpu: 6
     limits:
       memory: 16Gi
       cpu: 10
collector:
  service:
    otlp:
      grpc:
         name: otlp-grpc
         port: 4317
      http:
         name: otlp-http
         port: 4318
Vivekgaddigi commented 8 months ago

Try the latest version 1.0.2

yitzhtal commented 8 months ago

I upgraded to 1.0.2 and used node selector for more stable nodes (not spot instances). It works now, see if it'll be stable, I'll update

Vivekgaddigi commented 8 months ago

close the issue if it sorted

yitzhtal commented 7 months ago

I still can't seem to make Jaeger stable, I got this errors:

 ERROR [main] 2024-04-11 08:29:47,486 CassandraDaemon.java:774 - Exception encountered during startup                                                                                                                                       │
│ java.lang.RuntimeException: A node required to move the data consistently is down (/10.50.13.161). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false  │
│     at org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:294) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                               │
│     at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:177) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                      │
│     at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:87) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                         │
│     at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1530) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                               │
│     at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1024) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                           │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:718) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                               │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:652) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                               │
│     at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:397) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                   │
│     at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                │
│     at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                    │
│ INFO  [StorageServiceShutdownHook] 2024-04-11 08:29:47,488 HintsService.java:209 - Paused hints dispatch                                                                                                                                   │
│ WARN  [StorageServiceShutdownHook] 2024-04-11 08:29:47,488 Gossiper.java:1655 - No local state, state is in silent shutdown, or node hasn't joined, not announcing shutdown                                                                │
│ INFO  [StorageServiceShutdownHook] 2024-04-11 08:29:47,488 MessagingService.java:985 - Waiting for messaging service to quiesce                                                                                                            │
│ INFO  [ACCEPT-/10.50.10.10] 2024-04-11 08:29:47,489 MessagingService.java:1346 - MessagingService has terminated the accept() thread
robertwenquan commented 3 weeks ago

looks similar. ran into this with one of the pod keeps crashing with the 3.0.10 chart

jaeger-cassandra-0                  1/1     Running            0                  13d   10.0.3.24     c21   <none>           <none>
jaeger-cassandra-1                  0/1     CrashLoopBackOff   6 (2m7s ago)       12m   10.0.10.216   c34   <none>           <none>
jaeger-cassandra-2                  1/1     Running            0                  46d   10.0.0.47     p11   <none>           <none>