spandex-project / spandex_datadog

A datadog adapter for the `spandex` library.
MIT License
59 stars 42 forks source link

Add rule_psr and limit_psr metrics to improve trace ingestion rate #45

Closed mrz closed 2 years ago

GregMefford commented 3 years ago

Hey! Thanks for taking the time to research this and make a pull request! ❤️ I started to look into these attributes a bit and it's not clear to me what they mean and how they should be used. From looking at some of the other first-party Datadog libraries, it seems like these would tell Datadog about the sampling rate and rate limiting that is being applied to the traces within the application, so that they can estimate metrics based on traces that they can assume we did not send them. Since Spandex doesn't implement percent sampling or rate-limiting, I think not including these metrics might behave the same as including them with static values of 1.0. Do you know of any documentation that I can refer to about this or a test case that I can set up to demonstrate the current vs. expected behavior?

mrz commented 3 years ago

Hey Greg,

the only "documentation" I have is the feedback we received when we contacted Datadog support to help us in figuring out this issue. This is the most relevant snippet regarding the topic:

We've heard back from our engineering team regarding this case and they confirmed that the best path forward would be to set the tags in the appropriate places.

_dd.rule_psr and _dd.limit_psr is what tells the backend that a root span has been processed by a tracer with a sampling rule for its service. They’re needed for a service to appear as configured in the ingestion page.

“Default” means that none of the traces that we received (for this service) had those metrics ”Partially configured” means that some of the traces had those metrics ”Configured” means that all the traces had those metrics

Therefore you would want to set _dd.rule_psr and _dd.limit_psr for the root spans to have the ingestion page report the correct configuration numbers.

_sampling_priority_v1 is what tells the Datadog agent to sample the span that’s coming in. This should be set on all of the spans if you want all of them be ingested.

And as you said,

Since Spandex doesn't implement percent sampling or rate-limiting, I think not including these metrics might behave the same as including them with static values of 1.0

I kind of believe that this fix is just so that Datadog knows that Spandex is sending everything and is able to properly report this in the Ingestion Control page (APM -> Setup & Configuration -> Ingestion Control). Before this change, the service was reporting very low ingestion rate. After the patch, nothing changed in terms of actual spans/traces in the service, but in the Ingestion Control page we now have 100% ingestion rate and "Fully Configured" tracer configuration.

GregMefford commented 3 years ago

Ah! Thank you, that’s very helpful! I guess since we always set a sampling priority, DD is confused because we are telling them to sample all of them, but aren’t telling them that this represents 100% of the actual traces. 👍