jaegertracing / jaeger-operator

Jaeger Operator for Kubernetes simplifies deploying and running Jaeger on Kubernetes.
https://www.jaegertracing.io/docs/latest/operator/
Apache License 2.0
1.03k stars 345 forks source link

Spark dependencies job failing (and constantly being re-created) [cassandra] #1163

Open rrichardson opened 4 years ago

rrichardson commented 4 years ago

I have jaeger operator running (quite well, I might add) on a k8s cluster with a small, 3 node cassandra cluster.
Initial setup works just fine. It is handling the load of 100% sampling in our cluster.

However, for some reason, the operator seems to think it needs to keep re-running the spark-dependencies job, and it keeps failing. I don't know if it's ever succeeded, but I regularly delete the job just to shut up the alarms that it generates.
The error log file is attached.
jaeger.log

Here is my Jaeger manifest.


apiVersion: v1
items:
- apiVersion: jaegertracing.io/v1
  kind: Jaeger
  metadata:
    annotations:
    creationTimestamp: "2020-08-08T15:40:54Z"
    generation: 3
    labels:
      jaegertracing.io/operated-by: jaeger.jaeger-operator
      manager: kubectl
      operation: Update
      time: "2020-08-08T15:40:54Z"
    - apiVersion: jaegertracing.io/v1
      fieldsType: FieldsV1
      manager: jaeger-operator
      operation: Update
      time: "2020-08-08T15:41:03Z"
    name: core
    namespace: jaeger
    resourceVersion: "24044973"
    selfLink: /apis/jaegertracing.io/v1/namespaces/jaeger/jaegers/core
    uid: c93dbbef-e917-4713-becc-362acb1227b1
  spec:
    agent:
      config: {}
      options: {}
      resources: {}
    allInOne:
      config: {}
      options: {}
      resources: {}
    collector:
      config: {}
      options: {}
      resources: {}
    ingester:
      config: {}
      options: {}
      resources: {}
    ingress:
      enabled: false
      openshift: {}
      options: {}
      resources: {}
      security: none
    query:
      options: {}
      resources: {}
    resources: {}
    sampling:
      options:
        default_strategy:
          param: 100
          type: probabilistic
    storage:
      cassandraCreateSchema:
        datacenter: jaeger-us-east-1
        enabled: true
        mode: test
      dependencies:
        enabled: true
        resources: {}
        schedule: 55 23 * * *
      elasticsearch:
        nodeCount: 3
        redundancyPolicy: SingleRedundancy
        resources:
          limits:
            memory: 16Gi
          requests:
            cpu: "1"
            memory: 16Gi
        storage: {}
      esIndexCleaner:
        numberOfDays: 7
        resources: {}
        schedule: 55 23 * * *
      esRollover:
        resources: {}
        schedule: 0 0 * * *
      options:
        cassandra:
          servers: cassandra
      type: cassandra
    strategy: allinone
    ui:
      options:
        menu:
        - items:
          - label: Documentation
            url: https://www.jaegertracing.io/docs/1.18
          label: About
  status:
    phase: Running
    version: 1.18.1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
jpkrohling commented 4 years ago

@rubenvp8510 are you able to look into this?

rubenvp8510 commented 4 years ago

I wasn't able to reproduce this issue.

The spark job error is an indicator for some reason Cassandra is closing the connection it could be that the Cassandra cluster is in bad state (which I doubt due the fact that jaeger is working property) or may be just a misconfiguration issue. It would be interesting to know the reason, Could you attach your Cassandra logs? That could help to figure out what is happening.

jpkrohling commented 4 years ago

However, for some reason, the operator seems to think it needs to keep re-running the spark-dependencies job

@rubenvp8510: do we have any logic in the code that would trigger this? What happen on the reconciliation of the job object once the initial setup has been made?

@rrichardson logs would really be helpful to determine why they are failing

rubenvp8510 commented 4 years ago

I don't think operator has some logic for running the spark-dependencies job, it only creates the cronjob which is executed according to the cron schedule rule.

jpkrohling commented 4 years ago

Not for running, but for (re)creating.

rubenvp8510 commented 4 years ago

well, if you delete the cronjob definition it may be re-created on the next reconciliation process, the only way to disable it completely is to update the Jaeger CR setting dependencies.enable to false.

jpkrohling commented 4 years ago

Sorry, I should have been clearer earlier. Basically, I consider an error in the Jaeger Operator if it leaves the cluster in a state where a job keeps failing over and over. It might be how we provision things, or it might be an underlying error condition that we are not accounting for.

rrichardson commented 4 years ago

Here is the catted logs of the 3 cassandra nodes.
I can't find anything indicating that the job even attempted and failed to connect.

cass.log

jpkrohling commented 4 years ago

@rrichardson, how long is the job taking until it fails? Could it be that the connection is being closed due to some timeout? I suppose you don't see the dependency graph in the Jaeger UI, as the job is failing, but could you please double-check this is indeed the case?

rubenvp8510 commented 4 years ago

From Cassandra logs, I see some GC pauses, so it not surprised me that this is because some timeout. Is there a way to increase the timeout in job? May be we can try to see if that fixes the issue.

kholisrag commented 2 months ago

I encounter this, when I use elasticsearch, so its created the jaeger-spark-dependencies, is there a way to disable it completely?