wavefrontHQ / wavefront-sdk-go

Wavefront Core Go SDK
Apache License 2.0
4 stars 21 forks source link

DirectSender.Flush() not performing a full flush #25

Open prydin opened 5 years ago

prydin commented 5 years ago

When DirectSender.Flush() is called, it calls Flush() on the underlying LineHandlers. LineHandler.Flush() only performs a flush of a single batch, leaving the rest of the buffer in memory. Typically, Flush() on something means that you'd block until all data is written to some backend storage. I suggest we change the code to call FlushAll() on the LineHandlers instead, since this would be more in line with typical semantics of a Flush() call.

If we are worried about flooding Wavefront with points, we may want to offer a FlushPartial() or FlushCurrentBatch(), but, in my opinion, it's confusing to have Flush() do something that's not inline with what programmers typically expect.

Even if we don't change the behavior of Flush() the senders need to offer a FlushAll(). This is crucial if you have a data collector that doesn't run continuously, e.g. something that's started from cron, does its job and then dies. Such a program needs a way to ensure that everything is written to the backend before it exits.

LukeWinikates commented 1 year ago

I would like to go ahead and make this change.

I am thinking:

LukeWinikates commented 1 year ago

additional thoughts from talking with @priyaselvaganesan and looking at the equivalent code in the Java SDK.

It looks like the flush() method in the Java SDK may also be only flushing batches, not performing a full flush.

We wonder about the safety of performing a full flush. What if the SDK is sending to a proxy that is already overwhelmed? Will the proxy throttle those requests? Will it create backpressure? Would we want flush to block indefinitely if the proxy is making us wait? Or would this not happen because the SDK buffer can only be so big anyway?

What if the flush takes a really long time? Should there be a timeout?

Thinking about renaming Flush to FlushBatch, with the current semantics, and possibly adding a full Flush option later (possibly with a context parameter so we can enforce a timeout).

LukeWinikates commented 12 months ago

Another thought:

I'm pretty sure Flush() will re-queue failed metrics that it thinks will be sendable in future batches. So if we naively made FlushAll()by repeatedly calling the current Flush() until the buffer was empty, I think we could get into an infinite loop.

Maybe failed lines need a retry limit. Maybe FlushAll() has a different implementation from Flush(), and doesn't re-queue metrics that failed to send.