openzipkin / openzipkin.github.io

content for https://zipkin.io
https://zipkin.io
Apache License 2.0
38 stars 63 forks source link

Elaborate, document and propose remedies for "the 10k span problem" #80

Open codefromthecrypt opened 7 years ago

codefromthecrypt commented 7 years ago

Traces that have orders of thousands of spans can be problematic. They can choke the UI (not just ours) and increase the operating costs of a tracing system. There are a number of scenarios which can result in "the 10k span problem", such as broadcast messaging to boundless consumers or buggy traced loops. Some workarounds are easier than others. For example, dropping local spans reported is easier than trying to coordinate message consumers to have them drop.

This issue should clarify the major scenarios, known workarounds and remedies. Hopefully, it can result in at least documentation, and in ideal case in coding practice that defends against this

Here are some breadcrumbs:

ImFlog commented 7 years ago

One of the solution discussed could be to drop spans. The question is where to do It ?

codefromthecrypt commented 7 years ago

One of the solution discussed could be to drop spans. The question is where to do It ?

  • on the Zipkin server side (with a rule system for example), this could fix the UI issue

By server, I think you mean at query time, right? One tradeoff of dropping at query time is that there is an assumption the only customer of the api is the UI (which isn't the case, eventhough it is the primary consumer). Nested in the attached google doc is a slight variation which is to drop or simply collapse (make unrenderable) spans in the client-side javascript. This is another option to help from overloading the UI, and it has the advantage of not requiring a data model change or dropping data.

  • on the collector side, this reduces the load but also introduces complexity : How do we know this is a long trace

To qualify what you've mentioned here, this is where you don't know how many spans will be created in the process (for example, broadcast messaging spans, which fork on receipt). There are scenarios that create a lot of spans in-process, and the local tracer could sample there w/o coordination.

codefromthecrypt commented 7 years ago

so one way to proceed from here could be to enumerate different patterns and strategies for each. For example 10k spans due to local spans, or broadcast, or RPC, etc. I've created a google doc here that might help

https://docs.google.com/document/d/1XkFGflrQP4wF8vqv-veFDE-t-V5iyH5bh9VXRYaOROg/edit

you can also look here for some text about common tracing patterns, the summary of which might be helpful in elaborating. https://drive.google.com/drive/u/0/folders/0B0tSnQT3uGdAUVVUcDA5d21rRWM