spandex-project / spandex

A platform agnostic tracing library
MIT License
333 stars 53 forks source link

Trace batching #37

Closed aspett closed 6 years ago

aspett commented 6 years ago

While testing distributed tracing, I noticed that traces were only getting sent when the threshold number of traces had been stored up. This means that if you have very low traffic for a period of time, (i.e. number of requests is less than batch_size), the traces won't get sent out.

I can understand that one solution to this is probably to tune that number to be lower (if not 1? eek), but one alternative I had in mind was some sort of timeout which triggered sending batches.

Keen on your thoughts around such an idea?

zachdaniel commented 6 years ago

I think that something like that could be implemented. Might be best to just send all spans after a configured set interval, as a simple first implementation.

zachdaniel commented 6 years ago

I'd welcome a PR for that, as I've had similar concerns

zachdaniel commented 6 years ago

@aspett I've got some incoming work that I think is going to change quite a bit of code. I don't think that it will affect the trace sender really, just wanted to let you know :D

aspett commented 6 years ago

Sweet as. Thanks for the headsup

aspett commented 6 years ago

One thing to take into consideration is https://github.com/DataDog/datadog-trace-agent/blob/79c31f7cd1ebbb7edb3f24028c70b3724204eab6/cmd/trace-agent/agent.go#L216

that the agent currently expects to receive traces within 10 seconds.

zachdaniel commented 6 years ago

I had no idea. I'm actually going to be working on breaking out the datadog sender and span logic into a different repo (haven't started yet, I'm working on configs first). I guess we just haven't noticed because we hit our flush interval relatively quickly.

zachdaniel commented 6 years ago

I think native sampling as a strategy for not creating too much load will be the fix for that really. I still have some reading to do, but I think the idea is that you set your sample rate to something that just won't affect your system, and send any traces pretty much immediately. To do it right though, there will have to be rules/match patterns for traces, such that you could for instance say "never sample errors" and stuff like that. I think the config changes I'm working on might dovetail neatly into that though.

aspett commented 6 years ago

Okay. What sort of timeline do you have for doing that sampling? I'm just about finished with a very simple implementation of having a max interval between pushes

zachdaniel commented 6 years ago

Its hard to say. The sampling for me isn't as high a priority as a few other things. My order of events (at least for now), is:

1.) Changing how the library is configured 2.) Adapter agnostic trace/span structs 3.) An ETS storage option for trace data 4.) Breaking out the adapters 5.) Some updates to the trace API for an adapterless use case that is being worked on by someone else 6.) Sampling

I generally work pretty quickly, but I don't want to give timelines that I can't commit to as this is a side project. I will try and get to breaking out the adapters in the next couple of days. Once that is done, if I can't get to sampling fast enough for your needs, you could add it to that project separately by just forking the adapter.

zachdaniel commented 6 years ago

@aspett I think the best way to do this instead of configuring a max interval would be just to configure a flush interval, and flush all spans every 1-5 seconds or something along those lines. That should be a pretty simple implementation, and doesn't offer behavior it can't really guarantee, since it just says "every 5 seconds, I'll get rid of all spans I know about". Sampling will come soon :D

aspett commented 6 years ago

No problem. I think we'll be able to survive using a batch size of 1 until sampling comes :)

zachdaniel commented 6 years ago

Closing this in favor of sampling, which is on its way.