jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.28k stars 2.42k forks source link

Add Cassandra in healthcheck #633

Open ledor473 opened 6 years ago

ledor473 commented 6 years ago

We are observing a strange issue where both Query and Collector does not reconnect to the Cassandra cluster and both logs "gocql: no hosts available in the pool"

The problem seems to be either with GoCQL or with our setup, but I think that Query and Collector healthcheck should reflect that as a problem.

As of now, killing the pods were enough to fix the issue.

chandresh-pancholi commented 4 years ago

@jpkrohling I would like to pick up. Could you please give some pointers?

jpkrohling commented 4 years ago

@chandresh-pancholi take a look at the healthcheck package and get familiar with how it works. After that, you'll need to find a way to propagate the information from the storage level up to the health check handler. I think we might have discussed this in the past, so, do a quick search for "health check" in our issues, or in the PRs that introduced/changed the health check.

I guess the change itself might be easy, but figuring out the appropriate way to do it might be tricky :-)

yurishkuro commented 4 years ago

I'd like to understand if this is really the behavior we want. If collector starts returning errors from the health check then the orchestration system can quite possibly restart it. Killing a service because its downstream is unavailable sounds counter intuitive to me.

jpkrohling commented 4 years ago

The health check HTTP status code can indicate that the service is degraded, and the response content can include details about what's wrong. As long as the HTTP status code isn't in the 4xx/5xx class, the orchestration engine shouldn't kill instances.

yurishkuro commented 4 years ago

But I think that was the original intention of the ticket. What is that "other" status code that could be used? What will react to it and how?

jpkrohling commented 4 years ago

Any 2xx/3xx can be used except perhaps for 204 (No Content) and 200 (OK). Kubernetes will accept anything higher or equal to 200 and lower than 400 as success.

At least at the beginning, we shouldn't be returning anything that would cause instances to be automatically killed. The monitoring tools used by our users should be the ones taking this decision, possibly by looking into the response's content.

yurishkuro commented 4 years ago

If it's not a response that orchestration would do something about, then why bother surfacing this via health check? I would rather steer people to use proper telemetry, there are bunch of metrics about Cassandra usage, including failures to write.

jpkrohling commented 4 years ago

The health checks aren't only for the orchestration layer. In @ledor473's specific case, we should track what exactly is wrong and what we can do inside Jaeger (or operator) to solve that problem automatically. While this is not fixed in the code, a monitoring solution (or script?) can exist somewhere, checking the HTTP status code and/or the payload, and restart the process in case that specific problem is happening.

The operator could also return a richer status report as part of the JaegerStatus object, shown via kubectl describe jaeger my-jaeger (among others).

yurishkuro commented 4 years ago

what exactly is wrong

That's a tall order. In @ledor473's specific case, re-initializing Cassandra sessions would've probably fixed the issue - that would be useful behavior to support. Aside from that, I don't see what other useful things we are able to detect that are not already included in the metrics we emit. Transforming those metrics into the details of the healthcheck output doesn't strike me as a particularly useful code to support.

ledor473 commented 4 years ago

We are not using Cassandra for Jaeger anymore, it's hard for me to provide any additional context.