jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.18k stars 2.4k forks source link

Support tail-based sampling from OTEL Collector #5867

Closed yurishkuro closed 1 week ago

yurishkuro commented 3 weeks ago

Jaeger v2 can now support tail based sampling that exists in OTEL Collector as an extension.

As a single task it's too large for good-first-issue, but can be done incrementally

mahadzaryab1 commented 3 weeks ago

@yurishkuro I'm interested in working on this but have never contributed to this repository before. Would you be able to guide me on how to tackle this issue?

yurishkuro commented 3 weeks ago

@mahadzaryab1 I would start with

mahadzaryab1 commented 3 weeks ago

sounds good! i'll give this a shot

mahadzaryab1 commented 3 weeks ago

@yurishkuro I got the first item done in https://github.com/jaegertracing/jaeger/pull/5878 and read the blog post. Are there any instructions/examples on how I can can create a sample configuration and docker-compose file. Thank you so much for your time and help!

yurishkuro commented 3 weeks ago

Sample configurations are provided in the blog post. Example of docker compose using Jaeger is in docker-compose/monitor/docker-compose-v2.yml (but you will need to build the Docker image locally so that the code recognizes tail sampling processor, because officially published image will not)

mahadzaryab1 commented 3 weeks ago

@yurishkuro thank you! as a follow-up, i'm trying to build the image locally by running make build from jaeger/docker-compose/monitor but am running into the following error. am i missing a step in between?

+ cd packages/jaeger-ui
./scripts/rebuild-ui.sh: line 35: cd: packages/jaeger-ui: No such file or directory
make[2]: *** [rebuild-ui] Error 1
make[1]: *** [jaeger-ui/packages/jaeger-ui/build/index.html] Error 2
make: *** [build] Error 2
mahadzaryab1 commented 3 weeks ago

@yurishkuro this is how i've modified the docker setup to reproduce tail based sampling - let me know if you have any feedback

yurishkuro commented 3 weeks ago

you need to do initialize & update git submodules in order to access UI dir

mahadzaryab1 commented 3 weeks ago

@yurishkuro awesome! thank you so much. I was able to get the setup going based on the configuration I linked above. Here's some sample output I'm seeing from the different policies that are being evaluated:

jaeger-1          | 2024-08-23T05:14:01.932Z    debug   sampling/string_tag_filter.go:95    Evaluting spans in string-tag filter    {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "string_attribute"}
jaeger-1          | 2024-08-23T05:14:01.932Z    debug   sampling/latency.go:34  Evaluating spans in latency filter  {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "latency"}
jaeger-1          | 2024-08-23T05:14:01.932Z    debug   sampling/probabilistic.go:46    Evaluating spans in probabilistic filter    {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "probabilistic"}
jaeger-1          | 2024-08-23T05:14:01.932Z    debug   sampling/status_code.go:54  Evaluating spans in status code filter  {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "status_code"}
yurishkuro commented 3 weeks ago

I suggest now to think about what e2e integration test could look like for this

mahadzaryab1 commented 3 weeks ago

@yurishkuro will do - thanks for all your help and guidance so far! should the sample configuration/docker compose/readme go in https://github.com/jaegertracing/jaeger/tree/main/examples?

yurishkuro commented 3 weeks ago

docker-compose/tail-sampling

mahadzaryab1 commented 2 weeks ago

@yurishkuro I completed the second task of creating a sample configuration. I'm looking for a bit of guidance on the e2e integration tests. I see that we have some integration tests but those mostly seem to be for the storage collectors. In this case, do we want to maybe test that the traces are being sampled according to the policies in our config? For example, we could test the 'filter-by-attribute' or the 'all-errors' policies by ensuring that those traces are getting captured and the rest are getting filtered out.

yurishkuro commented 2 weeks ago

You can use the tests we have as inspiration, but don't try to fit your test to them. How would you e2e test tail sampling if starting from scratch? Try explaining the whole setup and test process.

mahadzaryab1 commented 2 weeks ago

@yurishkuro I am envisioning something like this:

  1. Set up the jaeger-v2 collector with a tail-sampling processor with a string_attribute policy that matches on a particular tag
  2. Start the load balancing collector and jaeger-v2 collector from the docker-compose setup
  3. Send spans with various attributes to the otel-collector with load balancing
  4. Test that only the tags that match the ones listed in the policy are sampled/stored by the jaeger-v2 collector

Let me know what you think and if you have any feedback!

yurishkuro commented 2 weeks ago

SGTM. A couple thoughts:

BTW, in order to perform this test you do not need load generator running continuously, it's better if you just generate a fixed number of traces. You also need to make sure the storage is purged between A and B.

mahadzaryab1 commented 2 weeks ago

@yurishkuro Thanks for the follow-up. I'm going to try playing around with tracegen to begin with to see what kind of traces I can generate. I'm currently trying to replace mirosim with tracegen to the docker-compose.yml with the following configuration.

  tracegen:
    image: jaegertracing/jaeger-tracegen:latest
    environment:
      - OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://jaeger:4318/v1/traces
    command: ["-duration", "10s", "-workers", "3", "-pause", "250ms"]
    depends_on:
      - jaeger

This doesn't seem to work and gives me the following error:

tracegen-1        | 2024/08/24 20:42:32 traces export: Post "http://jaeger:4318/v1/traces": dial tcp: lookup jaeger on 127.0.0.11:53: no such host
tracegen-1 exited with code 0

Do you know what I'm missing? Here is the full docker-compose set up if that helps.

yurishkuro commented 2 weeks ago

You're missing network setting on tracegen compose config

mahadzaryab1 commented 2 weeks ago

@yurishkuro thank you so much! As a follow up, I was wondering if you had any guidance (or an example if we're already doing this somewhere) on how I can query the spans that are stored? If we set up a policy on the service name, then we can do an A/B test with and without sampling as you suggested by querying the spans from store.

mahadzaryab1 commented 2 weeks ago

@yurishkuro I've been reading through the storage integration tests. If we add a filter on the service name in our tail sampling processor config, then can we do something similar to https://github.com/jaegertracing/jaeger/blob/main/plugin/storage/integration/integration.go#L136 to get all the services from the traces we have collected (in memory) and ensure that we only have the ones we listed in our policy?

mahadzaryab1 commented 2 weeks ago

As well, do you have any thoughts on where the integration tests should live? It seems like the storage ones are in cmd/jaeger/internal/integration.

yurishkuro commented 2 weeks ago

to get all the services from the traces we have collected (in memory) and ensure that we only have the ones we listed in our policy?

Conceptually yes, but sampling works at the trace level (all spans or nothing), so still need a bit more thought how you'd test them

yurishkuro commented 2 weeks ago

The test needs a new workflow yaml file and a new shell script to orchestrate. Then depending on how you want to do the analysis you might need more - at this point I don't know what you have in mind. If you need go code then cmd/jaeger/internal/integration would be a good place.

mahadzaryab1 commented 2 weeks ago

@yurishkuro Would it not be enough to have a policy in the tail sampling processor that filters out a single service using a string attribute policy, say tracegen-02. We can then use tracegen to generate traces for some number of services, say 5. Then, if our tail sampling processor is working as expected, we would only store tracegen-02 and discard the rest.

yurishkuro commented 2 weeks ago

That's fine as a first version, but it's not quite a robust test because it doesn't prove that service02 was actually generated in the first place. An A/B test wouldn't have that problem.

mahadzaryab1 commented 2 weeks ago

@yurishkuro I see. How about something like?

  1. Start jaeger backend with batch processor
  2. Use tracegen to generate traces for 5 services
  3. Query the data store to see that traces are present for all 5 services
  4. Flush the data store
  5. Start jaeger backend with tail sampling processor that has a policy to only sample service.name=tracegen-02
  6. Query the data store to see that traces are present only for tracegen-02
mahadzaryab1 commented 2 weeks ago

@yurishkuro would you be able to help me with the setup of the test? i'm stuck on the following:

yurishkuro commented 2 weeks ago

It is easiest to query data store if you write Go code in cmd/jaeger/internal/integration, because that's exactly what the tests located there are doing - they are using an RPC implementation of SpanReader which proxies the requests to the query service running in the jaeger-v2 collector.

Load balancer is not necessary since you will only be running a single instance of the collector. The objective of the load balancer is to ensure that all spans for the same trace ID end up in the same instance of the collector that runs tail sampling logic.

mahadzaryab1 commented 2 weeks ago

@yurishkuro Got it! Thank you so much. A couple of follow-ups I had:

yurishkuro commented 2 weeks ago

Yes if you have to write code to read you might as well save whichever traces you need via Span Writer.

The test won't automatically run because those integration tests all require an environment variable to activate, otherwise they will all run at once and most of them fail because storage backend won't be available.

mahadzaryab1 commented 1 week ago

@yurishkuro I've completed the integration test and pushed it to https://github.com/jaegertracing/jaeger/pull/5878. Working on the final README task - should this go in jaeger/docker-compose/tail-sampling?

yurishkuro commented 1 week ago

yes please

mahadzaryab1 commented 1 week ago

@yurishkuro Thanks for all your help and guidance in helping me complete this issue. I learnt a lot about OpenTelemtry, Jaeger, and Distributed Tracing. I'm very excited about this project and would like to keep contributing to it. Do you have any recommendations for what I can pick up next?

yurishkuro commented 1 week ago

@mahadzaryab1 appreciate the help. The top priority is completing the work on Jaeger-v2, which you can see in the project board https://github.com/orgs/jaegertracing/projects/3/views/2