jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.15k stars 2.4k forks source link

Add flag to control collector's serviceName cache #3295

Open Jaans opened 2 years ago

Jaans commented 2 years ago

Describe the bug Deployment is a Elasticsearch backend with ILM based rollover, populated by jaeger-collector that receives GRPC based tracing data from jaeger-agent.

The problem appears to be that jaeger-collector only ever sends the "service" index data to the backend store once. For "span" index data all are sent to backend.

Using Elasticsearch ILM index rollover (with short time frames) exposed this issue where "service" index data would be deleted out over time (as expected from ILM rollover) but jaeger-collector does not send "service" index data to the new rolled over index (it doesn't know it's rolled over and deleted out I guess). This in turn breaks the Jaeger UI because the Search is primarily driven by the "service" drop down list box.

By bouncing the jaeger-collector daemon/service, the "missing" services appear immediately in Jaeger UI.

To Reproduce Steps to reproduce the behavior:

  1. Setup Elasticsearch backend with ES+Kibana as per current version and supporting deployment documentation.
  2. Deploy ILM and Jeager ES rollover init for ILM as per current version and deployment documentation.
  3. Deploy jaeger-query for UI that looks at Elasticsearch backend.
  4. Deploy jaeger-collector to receive GRPC data from remote jaeger-agents and send onto Elasticsearch backend.
  5. Send tracing telemetry from client application to jaeger-agent and observe the corresponding "services" and "spans" appearing in Elasticsearch and also in Jaeger UI. This is stored in "jaeger-services-000001" and "jaeger-span-000001" respectively.
  6. ILM rollover results new index "jaeger-services-000002" and "jaeger-span-000002" and the previous iteration is eventually deleted.
  7. More tracing telemetry from client application successfully flows via jaeger-collector and populates "jaeger-span-000002", but "jaeger-service-000002" will does not services previously reported (at least that's what it looks like).
  8. The Jaeger UI "Service" drop down list eventually becomes empty and no data can be shown. Queries using Kibana confirms that there is span data present and still flowing in as the client application emits them.

If I stop the jaeger-collector daemon/service and start it again, the "services" almost immediately appears and is available for selection in the Jaeger UI.

It appears as if the collector retains some form of distinct service history.

This obviously doesn't happen with local all-in-one, but then again there is no scrubbing of the old data.

Expected behavior I would expect to see the related "service" index data to be present so that the Jaeger UI can allow me to to view the spans in the currently "hot" indexes.

Screenshots If applicable, add screenshots to help explain your problem. Initial population with the first set of "hot" indexes. All good with the services present: image image

After rollover where jaeger-services-000001 has been deleted out, the 7 services now gone (ignore the 1"new" service in 000002 not previously emitted). Tracing data is still being sent for those same services, but only the "span" index shows these: image image

Version (please complete the following information):

What troubleshooting steps did you try? Try to follow https://www.jaegertracing.io/docs/latest/troubleshooting/ and describe how far you were able to progress and/or which steps did not work. The tracing data flows all the way through successfully. It's just that jaeger-collector seems to only send new and distinct "service" index data (optimisation maybe?) but with index rollover that breaks the Jaeger UI because the services filter becomes empty, despite there being "span" index data.

Additional context Add any other context about the problem here.

pavolloffay commented 2 years ago

How often do you rollover to a new index? Jaeger-collector caches service names for 12h - it stores a service name if it is not in the cache and then it does not store it for the following 12h.

if the rollover configuration is setup to e.g. 2h then this cache might be the core of the problem. We could add a flag to control it.

Jaans commented 2 years ago

@pavolloffay Thank you for your response. That would certainly explain the behaviour I'm seeing.

I'm not sure of the use case for others, but for us, we have our Elasticsearch configured such that only the jaeger-span-* indexes are rolled over (at 5GB intervals which turns out to be about every 3-4 hours).

We have however disabled ILM based rollover of the jaeger-service-* indexes because they have a very small amount of data really - the nature of our scenario I'm guessing. Disabling rollover for that index helps us avoid the above issue altogether, however, should we need to roll them over, we would need to restart at least one collector instance to store the services again.

Alternatively setting the interval for rollover of the jaeger-service-* index is a limited solution because even if we rolled the jaeger-service-* indexes every 2 or 7 days, we could still end up having to wait up to 12hrs before we have service data.

Perhaps a command line option to set the cache duration is a better choice, allowing users to chose the delay/latency they are comfortable with?

That said, we currently face a more challenging problem with the index template being recreated / updated every time a jaeger-collector instance starts (or at least it looks like that is what's happening). The issue stems from us tweaking the index template with additional fields (based on what we need to search / report on specific to our scenario), but the problem is that the index template is lost because it is overwritten upon start of a jaeger-collector instance, which also results in subsequent rolled over indexes missing that tweaked index definition.

Assuming that the jaeger-collector instance actually does recreate / update the index templates upon start up, it would be super helpful to be able to set a command line parameter to disable this behaviour so that we can retain our index template modifications. Is this something you guys are open to?

Thanks again for an awesome tool and implementation!!! Jaans

pavolloffay commented 2 years ago

I am removing the bug label as this is not a bug but rather a new use case (e.g. using rollover in a way it wasn't designed).

Assuming that the jaeger-collector instance actually does recreate / update the index templates upon start up, it would be super helpful to be able to set a command line parameter to disable this behaviour so that we can retain our index template modifications. Is this something you guys are open to?

      --es.create-index-templates                                Create index templates at application startup. Set to false when templates are installed manually. (default true)

Please refer to our docs for more information https://www.jaegertracing.io/docs/1.27/cli/

Jaans commented 2 years ago

Thank you @pavolloffay - that helps a lot! Despite trawling through the various ES CLI flags and failed to find the --es.create-index-templates item, which will help us out big time.

For the moment we don't automatically rollover the jaeger-service-* index (it remains small for our use case) and is a sufficient workaround for not being able to align the cache expiry with rollovers.

Thanks again for your help!