open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

tailsamplingprocessor ratelimiting works incorrectly with loadbalancingexporter #1629

Open pkositsyn opened 3 years ago

pkositsyn commented 3 years ago

Describe the bug I have been thinking about ratelimiting on sampling and this seems to work not as intended right now. I did no tests, but this is a totally idiomatic issue.

Imagine you want to sample traces with span rate limit 1000 spans / s and usually you get 1 trace / s with 500 spans. Having one collector, you save the trace and everything is fine. Scaling up to 10 collectors, you have to make sure, that the trace is still sampled and total rate limit is the same for this kind of trace (scaling could be done due to whatsoever problems). Leaving the old 1000 spans / s limit will result in 10_000 spans / s limit among all collectors. Cutting the limit to 100 spans / s for each collector will result in sampling no traces. Thus, there is either my misunderstanding of the horizontal scaling as a way to increase all resources in an equal proportion or an actual problem, treating this case as a possible one.

Seems like this problem is unsolvable without at least knowledge about number of backends, because the behaviour of single collector should be different for one and many backend cases.

Still, could there be something like a config parameter for number of backends to apply different behaviour? Anyway, I assume the solution to be rather complicated and maybe heuristic for this approach without synchronization (as it is postulated in collector to avoid external communication)

What did you expect to see? I expect to see an explicit configuration of collector according to the needs. If it is expected to see max 1000 spans / s for some filter given many collector backends, there should be a clear way to approach this.

Setting 1000 / numCollectors rate limit is firstly not the same as it is shown in the example above and secondly involves manual calculation and setting new numbers in the whole tailsamplingprocessor config.

Note that there is no same issue about probability based sampling.

github-actions[bot] commented 2 years ago

This issue has been inactive for 60 days. It will be closed in 120 days if there is no activity.