Open ledor473 opened 6 years ago
@jpkrohling I would like to pick up. Could you please give some pointers?
@chandresh-pancholi take a look at the healthcheck
package and get familiar with how it works. After that, you'll need to find a way to propagate the information from the storage level up to the health check handler. I think we might have discussed this in the past, so, do a quick search for "health check" in our issues, or in the PRs that introduced/changed the health check.
I guess the change itself might be easy, but figuring out the appropriate way to do it might be tricky :-)
I'd like to understand if this is really the behavior we want. If collector starts returning errors from the health check then the orchestration system can quite possibly restart it. Killing a service because its downstream is unavailable sounds counter intuitive to me.
The health check HTTP status code can indicate that the service is degraded, and the response content can include details about what's wrong. As long as the HTTP status code isn't in the 4xx/5xx class, the orchestration engine shouldn't kill instances.
But I think that was the original intention of the ticket. What is that "other" status code that could be used? What will react to it and how?
Any 2xx/3xx can be used except perhaps for 204 (No Content) and 200 (OK). Kubernetes will accept anything higher or equal to 200 and lower than 400 as success.
At least at the beginning, we shouldn't be returning anything that would cause instances to be automatically killed. The monitoring tools used by our users should be the ones taking this decision, possibly by looking into the response's content.
If it's not a response that orchestration would do something about, then why bother surfacing this via health check? I would rather steer people to use proper telemetry, there are bunch of metrics about Cassandra usage, including failures to write.
The health checks aren't only for the orchestration layer. In @ledor473's specific case, we should track what exactly is wrong and what we can do inside Jaeger (or operator) to solve that problem automatically. While this is not fixed in the code, a monitoring solution (or script?) can exist somewhere, checking the HTTP status code and/or the payload, and restart the process in case that specific problem is happening.
The operator could also return a richer status report as part of the JaegerStatus
object, shown via kubectl describe jaeger my-jaeger
(among others).
what exactly is wrong
That's a tall order. In @ledor473's specific case, re-initializing Cassandra sessions would've probably fixed the issue - that would be useful behavior to support. Aside from that, I don't see what other useful things we are able to detect that are not already included in the metrics we emit. Transforming those metrics into the details of the healthcheck output doesn't strike me as a particularly useful code to support.
We are not using Cassandra for Jaeger anymore, it's hard for me to provide any additional context.
We are observing a strange issue where both Query and Collector does not reconnect to the Cassandra cluster and both logs "gocql: no hosts available in the pool"
The problem seems to be either with GoCQL or with our setup, but I think that Query and Collector healthcheck should reflect that as a problem.
As of now, killing the pods were enough to fix the issue.