scalyr / scalyr-fluentd

The fluentd plugin for inserting log messages and metrics in Scalyr.
Apache License 2.0
6 stars 5 forks source link

Investigate if multiple sources can use the same tag #2

Open imron opened 9 years ago

imron commented 9 years ago

Check to see if multiple sources can use the same tag.

imron commented 9 years ago

It seems yes, but it also seems that not all sources can have a custom tag. This might affect whether or not we need to have a single session for all blocks.

czerwingithub commented 9 years ago

I'm not that versed in Ruby, but is it possible to create some sort of shared static reference to a single session object and have all matching blocks use that to send their messages to Scalyr? Obviously, we would have to worry about threading issues, and it would surely complicate this design.

For now, we are definitely going for a simple implementation and see how far that takes us. We'll see if this causes problems for our users in practice and we will address if there's a great enough need.

On Thu, Apr 2, 2015 at 7:14 PM, Imron notifications@github.com wrote:

It seems yes, but it also seems that not all sources can have a custom tag. This might affect whether or not we need to have a single session for all blocks.

— Reply to this email directly or view it on GitHub https://github.com/scalyr/scalyr-fluentd/issues/2#issuecomment-89121977.

imron commented 9 years ago

You can have class variables (e.g. the equivalent of static variables in Java). The code to take care of threading issues is basically all there already to handle the fact that fluentd's BufferedOutput can use multiple threads, it's just a matter of using @@ instead of @ in front of the variables that would be shared by all threads (currently @last_timestamp, @thread_ids and @next_id plus a mutex) and adding session id into that mix also.

imron commented 9 years ago

After a bit more thought, I'd probably move all the session state in to a separate Session class which would take care of providing thread-safe access to the last timestamp and the map of scalry thread_ids and any other info.

The Session class would be a Ruby singleton (which is thread safe in terms of no data races when being created/initialised) and the scalyr fluentd plugin would just call out to that to get timestamps and thread_ids.

czerwingithub commented 9 years ago

Ok, let's go ahead and do that whenever you have time. It seems worthwhile to me to take this approach early, and it doesn't seem like much work on your end.

imron commented 9 years ago

Ok, this is now done, however there is one potential issue I wanted to check on.

The Scalyr API docs state that:

so the ts field must be strictly increasing — each event must have a larger timestamp than the preceding event

Which is all well and good, but what happens where we have multiple blocks and multiple threads and encounter a situation like:

Thread 1 -> Start processing 100 events Thread 1 -> Timeslice is over, only 50 events processed, last timestamp is 1000 (pretend this is 1970) Thread 2 -> Start processing 100 events Thread 2 -> Timeslice is over, only 50 events processed, last timestamp is 2000 Thread 1 -> Process remaining events and send, events range from 0-1000 and 2001-3000 Thread 2 -> Process remaining events and send, events range from 1001-2000 and 3001-4000

From here we can see that for the batch of events sent by thread 2, even though the timestamp was strictly increasing, by the time we send the entire batch of events the server has already received a different batch for the same session that now includes timestamps higher than the lowest value of the current batch. e.g. Thread 2 contains a timestamp of 1001, but the server has already received a batch of events from Thread 1 contains 2001.

So, the timestamps for each event are unique across all threads and are strictly increasing when the event is created. By the time the events reach the server however there is the possibility that it might have received events with a later timestamp already.

If that is going to be an issue, the only way I can see around this is to hold the mutex for the entire event processing and sending, which is possible, but has performance implications that negate the point of having multiple threads in the first place.

czerwingithub commented 9 years ago

There are two general approaches to avoid the problems you mention. If either of these is too much too implement or if you feel like they would cost too much time, then let's not do it for now and leave it as a TODO. The current version will work well enough and this is just fine tuning it.

  1. Have only one thread (call it thread A) be responsible for sending the events to the server. All other threads just enqueue events to be sent. Thread A should also be responsible for assigning timestamps to all events before it sends the events to the server. This way, you only have to have a lock that protects the queue of outgoing events.
  2. Each thread can still send the events to the server, but have them only assign the timestamps to the events right before it sends all the events to the server. You still have to hold the lock while you serialize the json ... and there's still a problem if the thread loses its time slice between when it has serialized the json and when it sends it to the server. You really should hold a lock (in this model) while you issue the request... if you can that in a non-blocking fashion, then that would be great (if you can have a send that only blocks to enqueue the bytes on the socket, but doesn't block for them to actually be sent or a response received).

Anyway, unless you say otherwise, it feels to me that making this perfect is going to cost too much work for now and should just be done after the syslog work and when more customers are using the fluentd plugin.

imron commented 9 years ago

Option 1 would work, but then we'd lose all the backoff/retry functionality from fluentd and would need to implement that ourselves. We'd also then be holding all the events in memory (unless we wanted to do something more complex), compared to fluentd backs the buffered output to a file by default so could potentially lose events if there was an error that shutdown the process (e.g. something caused by an out of memory error)

With option 2 and using the standard Ruby http code, we'd have to hold the lock until we get the response back (there's no way to split request/response). Blocking for the entire request/response is possibly going to cause issues if there's a network problem and the request delays for several 10s of seconds before timing out (the fluentd buffers might start to fill up).

There are other http libs that support asynchronous http request/responses, however I'm not sure how well maintained they are (especially with respect to older version) and whether you want to introduce a dependency on a third-party lib.

How big of a problem is it if the scenario I mentioned in my previous comment happens? I've already pushed the code that has a single session across all threads, so if it's an issue I'd need to revert that back to the previous commit which has a session per thread.

czerwingithub commented 9 years ago

Ugh, ok, sorry for the trouble on this. Yes, please go ahead and revert your change. If we do receive events with lower timestamps that what was previously accepted, those new events will be dropped and recorded in the database.

Let's leave this issue open and come back when we need to fix it. It could also be that when we design the new API, we could make this problem go away.

czerwingithub commented 9 years ago

(I meant ".. will be dropped and NOT recorded in the database"

imron commented 9 years ago

Ok. The change has been reverted on master, and I've added new branch single-session which has all the single session code. I'll leave this issue open.

imron commented 9 years ago

Hah, need to make sure I don't click 'close and comment'.