Open codefromthecrypt opened 7 years ago
One of the solution discussed could be to drop spans. The question is where to do It ?
One of the solution discussed could be to drop spans. The question is where to do It ?
- on the Zipkin server side (with a rule system for example), this could fix the UI issue
By server, I think you mean at query time, right? One tradeoff of dropping at query time is that there is an assumption the only customer of the api is the UI (which isn't the case, eventhough it is the primary consumer). Nested in the attached google doc is a slight variation which is to drop or simply collapse (make unrenderable) spans in the client-side javascript. This is another option to help from overloading the UI, and it has the advantage of not requiring a data model change or dropping data.
- on the collector side, this reduces the load but also introduces complexity : How do we know this is a long trace
To qualify what you've mentioned here, this is where you don't know how many spans will be created in the process (for example, broadcast messaging spans, which fork on receipt). There are scenarios that create a lot of spans in-process, and the local tracer could sample there w/o coordination.
so one way to proceed from here could be to enumerate different patterns and strategies for each. For example 10k spans due to local spans, or broadcast, or RPC, etc. I've created a google doc here that might help
https://docs.google.com/document/d/1XkFGflrQP4wF8vqv-veFDE-t-V5iyH5bh9VXRYaOROg/edit
you can also look here for some text about common tracing patterns, the summary of which might be helpful in elaborating. https://drive.google.com/drive/u/0/folders/0B0tSnQT3uGdAUVVUcDA5d21rRWM
Traces that have orders of thousands of spans can be problematic. They can choke the UI (not just ours) and increase the operating costs of a tracing system. There are a number of scenarios which can result in "the 10k span problem", such as broadcast messaging to boundless consumers or buggy traced loops. Some workarounds are easier than others. For example, dropping local spans reported is easier than trying to coordinate message consumers to have them drop.
This issue should clarify the major scenarios, known workarounds and remedies. Hopefully, it can result in at least documentation, and in ideal case in coding practice that defends against this
Here are some breadcrumbs: