Open rrichardson opened 4 years ago
@rubenvp8510 are you able to look into this?
I wasn't able to reproduce this issue.
The spark job error is an indicator for some reason Cassandra is closing the connection it could be that the Cassandra cluster is in bad state (which I doubt due the fact that jaeger is working property) or may be just a misconfiguration issue. It would be interesting to know the reason, Could you attach your Cassandra logs? That could help to figure out what is happening.
However, for some reason, the operator seems to think it needs to keep re-running the spark-dependencies job
@rubenvp8510: do we have any logic in the code that would trigger this? What happen on the reconciliation of the job object once the initial setup has been made?
@rrichardson logs would really be helpful to determine why they are failing
I don't think operator has some logic for running the spark-dependencies job, it only creates the cronjob which is executed according to the cron schedule rule.
Not for running, but for (re)creating.
well, if you delete the cronjob definition it may be re-created on the next reconciliation process, the only way to disable it completely is to update the Jaeger CR setting dependencies.enable
to false
.
Sorry, I should have been clearer earlier. Basically, I consider an error in the Jaeger Operator if it leaves the cluster in a state where a job keeps failing over and over. It might be how we provision things, or it might be an underlying error condition that we are not accounting for.
Here is the catted logs of the 3 cassandra nodes.
I can't find anything indicating that the job even attempted and failed to connect.
@rrichardson, how long is the job taking until it fails? Could it be that the connection is being closed due to some timeout? I suppose you don't see the dependency graph in the Jaeger UI, as the job is failing, but could you please double-check this is indeed the case?
From Cassandra logs, I see some GC pauses, so it not surprised me that this is because some timeout. Is there a way to increase the timeout in job? May be we can try to see if that fixes the issue.
I encounter this, when I use elasticsearch, so its created the jaeger-spark-dependencies
, is there a way to disable it completely?
I have jaeger operator running (quite well, I might add) on a k8s cluster with a small, 3 node cassandra cluster.
Initial setup works just fine. It is handling the load of 100% sampling in our cluster.
However, for some reason, the operator seems to think it needs to keep re-running the spark-dependencies job, and it keeps failing. I don't know if it's ever succeeded, but I regularly delete the job just to shut up the alarms that it generates.
The error log file is attached.
jaeger.log
Here is my Jaeger manifest.