Jaeger operational observability

vprithvi commented 6 years ago

Requirement - what kind of business use case are you trying to solve?

Verifying whether all components of Jaeger have been deployed successfully without needing to run special applications, etc.

I would like confirmation of Jaeger agent and client health, version, reachability and it's ability to send spans.

Problem - what in Jaeger blocks you from solving the requirement?

none

Proposal - what do you suggest to solve the problem or improve the existing situation?

Have a page in the UI that allows users to see the following:

Hosts where Jaeger agents are running, their configuration values, version, and the last time the agent interacted with Jaeger Collector
Service names of services reporting to individual agents, their configuration values, version, and the last time a span was sent from that service
Effective sampling rate per service

This page is read only, maintains no history and only shows this information in realtime.

Any open questions to address

Determine how to get this data effectively with minimal changes. One approach may be to have Jaeger agents send spans when they start up and shut down.

jpkrohling commented 6 years ago

I think the main source of information should be to the collector. A new module (say, "admin") would then query all known collectors and aggregate this information when needed.

Ideally, this admin module would have an extra management endpoint, to get notified when new collectors are added/removed from the cluster.

pavolloffay commented 6 years ago

Is it related to https://github.com/jaegertracing/jaeger/issues/789 (Could we provide pre-made Grafana dashboards for Jaeger backend components?)?

yurishkuro commented 5 years ago

Isn't there a way to solve this with existing Observability tools instead of building a bespoke solution?

vprithvi commented 5 years ago

@yurishkuro What are you suggesting?

yurishkuro commented 5 years ago

Expose metrics so that all these signals can be observed via existing tools. Maybe provide base Grafana dashboards.

vprithvi commented 5 years ago

I agree certain things can be served by Grafana, but I'm not convinced it is a good solution.

Let's take the problem of determining which hosts jaeger-agents run on and what versions these agents are on. What metrics would we emit to surface this?

Note that version numbers, host names, and ipv6 are alphanumeric, and might not be easily stored by most metric backends. I'm not sure that storing these as tags is a good idea either - because tags usually don't have the same kind of lifecycle management as the metric

yurishkuro commented 5 years ago

Note that version numbers, host names, and ipv6 are alphanumeric, and might not be easily stored by most metric backends.

These attributes are often provided by metrics collection pipeline. E.g. if Prometheus discovers a running service and scrapes its metrics, it already knows version, host, etc. of the process and can include them as tags without our code trying to figure out what they are (which is harder and sometimes impossible).

My view is that we need to provide minimal required observability that is under control of the components themselves, and not venture into trying to capture runtime environments.

vprithvi commented 5 years ago

These attributes are often provided by metrics collection pipeline.

I'm not sure - re-reading the ticket, my intent was to determine whether all components are wired up correctly and are connected.

In this context, I believe the question we are answering is whether there is a jaeger-agent with a compatible version range connected to jaeger-collector, and which host/etc that it is running on.

I'm opposed to solely relying on metrics because of the following:

Tracing, an observability tool shouldn't depend on another observability tool to report on it's configuration and deployment. This also makes verifying this locally more involved because people need to set up a metric reporter, collector and UI correctly to verify this easily.
Assumptions are being made by the metrics pipeline - what happens for users not using prom? Or for users who have so many hosts that they are unable to label them in prom?

jpkrohling commented 5 years ago

my intent was to determine whether all components are wired up correctly and are connected

This is absolutely something we need, but I think @yurishkuro's question is valid. Depending on how users provision their Jaeger in production, they would have a service map and/or inventory defined somewhere already. As an alternative, I think @isaachier mentioned in another ticket a nice approach using a credit-based system, where clients would spend credits to check via HTTP whether a certain debug span was received by the agent. So, clients could emit this debug span upon bootstrap if a certain env var/config option is set.

About the metrics part, we expose the metrics in Prometheus format, but that doesn't mean that only Prometheus can read it: pretty much all metrics systems nowadays can read Prometheus "format" (OpenMetrics).

I think this request did not come from any end user, so, we are not sure we actually need this UI. I think @objectiser is working on a Grafana dashboard. Perhaps it would be better to wait for that task to complete and then assess what is missing.

vprithvi commented 5 years ago

Depending on how users provision their Jaeger in production, they would have a service map and/or inventory defined somewhere already.

This is very likely, but there is an assumption here that end users of tracing have access to this list, which might not always be the case. Additionally, in some organizations, the host jaeger-agent might be operated by a different group of people than those who operate jaeger-collector. (This might extend to metric systems)

Currently, debugging any connectivity issues is extremely painful even without cross organization boundaries. When there are organizational boundaries and limited access to hosts running jaeger components, it is very time consuming to figure out whether spans don't end up on collectors due to misconfiguration/connectivity of jaeger components.

Tools like Flink have a status page that shows connected components and their status, which really aids in debugging. I feel that this could be quite useful for us.

@yurishkuro At a minimum, I would like to capture the following:

agent hostname / ip
agent version
timestamp of last successful span submission

Any objections?

vprithvi commented 5 years ago

@yurishkuro bump

yurishkuro commented 5 years ago

I don't have a fundamental objection to this, but I prefer not to duplicate the data already present in the metricks. We recently did troubleshooting of connectivity from another DC, and I'm not sure that this extra page would've helped over just looking at metrics (which the other team couldn't do because our internal binary does not allow switching to Prometheus).

vprithvi commented 5 years ago

I don't have a fundamental objection to this, but I prefer not to duplicate the data already present in the metricks.

I also prefer not to duplicate metrics, but I feel that some duplication can have a lot of benefits. In fact, we are already in the process of doing this in https://github.com/jaegertracing/jaeger/pull/1465 where we are duplicating uptime and start time metrics.

I feel that removing the metrics dependency to answer the question of which jaeger-agents have been configured and started up correctly is a good introspection capability to have at the collector.

jaegertracing / jaeger