feat: implement simple debug backend endpoint for events

halfbyte commented 11 months ago

~~This currently is missing a queue and calls the backend directly from the agent.~~

~~Should I implement an events_worker within this PR or in the PR that adds the server backend?~~

Another question: Would it make sense to hand in the timestamp from the agent as an actual DateTime so that we could delegate the serialisation to the specific backend? Feels like it could be useful?

stympy commented 11 months ago

I'd go ahead and implement a worker in this PR.

joshuap commented 11 months ago

I'd go ahead and implement a worker in this PR.

I was asking more about the batching. :)

stympy commented 11 months ago

I was asking more about the batching. :)

Ah, yes, batching now would be fantastic. :)

halfbyte commented 11 months ago

I was asking more about the batching. :)

Ah, yes, batching now would be fantastic. :)

If I see this correctly the notify API does not currently batch at least in the ruby implementation? Do we have examples on how other client libraries do the batching? What kind of algorithm should we apply here?

Only batch when already throttled?
Batch until either batch.size > n or a specific timeout?
Batch until batch.size > n is probably not enough as it could result in never sending events?

To me, 2nd option sounds logical, but maybe you have something in mind already or an example of a similar implementation?

stympy commented 11 months ago

I think batching by watching the batch size and a timeout is the way to go. I'm not sure what numbers would be best, but I guess something like 200 events or 30 seconds would work.

joshuap commented 11 months ago

I think batching by watching the batch size and a timeout is the way to go. I'm not sure what numbers would be best, but I guess something like 200 events or 30 seconds would work.

Agreed, so option 2. If we want to get fancy we could have a backoff algorithm for the timeout (so it would send the first few events immediately, but then gradually slow down if there's a lot of events coming through). Let's save that optimization for later, though.

stympy commented 11 months ago

Oh, I should mention that we have a max payload size on the API of 5Mb, so total request size could also be a factor for deciding when to send a batch. :)

halfbyte commented 10 months ago

Oh, I should mention that we have a max payload size on the API of 5Mb, so total request size could also be a factor for deciding when to send a batch. :)

@stympy Does that count for compressed payloads or for payload after decompression? If compression plays into this should I just assume a certain compression factor (which works until a client starts posting mp3's through the events API)

I also wonder if keeping track of this should actually be a concern of the backend - Which would complicate matters because suddenly we'd need two separate queues with different logic of when to actually commence the sending.

halfbyte commented 10 months ago

I started working implementing batching but this is far from finished and untested (wanted to get the infra in before), so I turned this back to WIP.

stympy commented 10 months ago

Does that count for compressed payloads or for payload after decompression? If compression plays into this should I just assume a certain compression factor (which works until a client starts posting mp3's through the events API)

This max is checked before the payload is decompressed.

I also wonder if keeping track of this should actually be a concern of the backend - Which would complicate matters because suddenly we'd need two separate queues with different logic of when to actually commence the sending.

I don't think you need to worry about this right now... I just wanted to ensure everyone was aware of the limit.

subzero10 commented 10 months ago

FYI, I made a first version for this in the javascript client. I think the 30 seconds timeout is too long though (I put 50ms but I guess that might be too short), considering we might come up with a live tail feature at some point.

joshuap commented 10 months ago

Hey all, what's next for this and #512? Feel free to request fresh reviews from me when either PR is ready for another look.

halfbyte commented 10 months ago

@joshuap I'll spend the next two days trying to get this reviewable. I'll also take a look at the discussions on #512 to see if there's something that needs to be changed there. I have about 3 days left in terms of budget, my hope is to get at least this one merged into #512

halfbyte commented 10 months ago

@joshuap Could you please take a look at the EventsWorker? I am not super confident about my multithreading chops and while this is mostly a duplication of the Worker class, I had to thread in (haha!) a bit of extra functionality to be able to flush the newly introduced send queue (for batching). I removed code for checking the max payload for now as I didn't want to make it even more complicated.

I need to add some additional tests for batching etc., so far the events_worker_spec more or less just checks the same thing as the worker_spec.

halfbyte commented 10 months ago

Ah, just realized an important part missing here: To send events after a certain amount of time even if no new events come in :-/

halfbyte commented 10 months ago

Phew, that took quite a bit of refactoring. Please everyone throw some eyeballs on this, as I am quite afraid that there are some deadlocks lurking in this odd piece of code.

I had to add yet another thread to the mix to do the timeout checking - I thought long and hard about possible alternatives but I think this is the only one that does not remove the benefit of the queue as a scheduling mechanism.

I've replaced my thread variables with heavy usage of Mutex#synchronize and had to carefully add these to make sure I don't run into deadlocks straight on (For example, wrapping the whole send code lead to a deadlock when throttling was involved)

If y'all feel that the basis of this is sound I'll throw a bit of cosmetic refactoring at it and then maybe we can merge this so that I can tackle the server interaction.

stympy commented 10 months ago

This might be useful: https://github.com/ertygiq/msg-batcher

halfbyte commented 10 months ago

@stympy Thanks, I'll have a look. Not sure how much is transferable, but I'll read the code and see if I can learn anything from it for this (I assume you didn't propose to directly use it)

stympy commented 10 months ago

Not sure how much is transferable, but I'll read the code and see if I can learn anything from it for this (I assume you didn't propose to directly use it)

I know Josh prefers to keep to zero dependencies, so yeah, unless it would simplify things dramatically to use it directly, I'd say just see what you can learn from it.

halfbyte commented 9 months ago

@joshuap Ah, thanks for catching that. I'll have a quick look if I can see an obvious solution, but I would rather merge this so I can continue to add the http transport next (Which I've been working on already).

honeybadger-io / honeybadger-ruby

feat: implement simple debug backend endpoint for events #513