openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
16.97k stars 3.09k forks source link

backpressure mechanism for zipkin v2 collector api #1741

Open codefromthecrypt opened 7 years ago

codefromthecrypt commented 7 years ago

We routinely get into scenarios where a spike in load results in overwhelming the storage tier. For example, Elasticsearch pool is busy and we drop a bunch of spans. If we had a mechanism to know a collection source (such as RMQ or kinesis) is buffering, we could choose to push-back vs dropping these spans. Obviously we should be careful not to put poisonous messages back etc.

This issue will talk about what is possible, especially what is easy to support. For example, I've heard from @xeraa that Elasticsearch beats has a backoff algorithm used to avoid overwhelming elasticsearch. Maybe we can look into that.

cc @devinsba @llinder @shakuzen @anuraaga

xeraa commented 7 years ago

Since you are using your own client this won't be directly usable, but the bulk processor of the Elasticsearch Java client has a configurable BackoffPolicy.

PS: The new high-level REST Java client has just been released with 5.6.0.

codefromthecrypt commented 6 years ago

as mentioned by @gianarb on the influxdb PR.. we currently have flusher code in zipkin-reporter-java which attempts to do a rate limit with a buffer in front (for example flush on 2000 spans or 1 second, whichever first). The collector2 impl is related, as the buffer that allows for backpressure is the same we can use to slow or hasten writes (decoupling them from the incoming stream or http requests up to a tolerable limit we are ok dropping if the server dies)