Logger race condition - loses lines if removeHandler called from another thread while logging

python / cpython

The Python programming language

https://www.python.org

Other

63.15k stars 30.24k forks source link

Logger race condition - loses lines if removeHandler called from another thread while logging #79366

Open f0cc6a08-f8fc-4d78-b1be-3c701a5c623a opened 5 years ago

f0cc6a08-f8fc-4d78-b1be-3c701a5c623a commented 5 years ago

BPO	35185
Nosy	@vsajip, @ben-spiller
Files	logger-race.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', 'type-bug', 'library'] title = 'Logger race condition - loses lines if removeHandler called from another thread while logging' updated_at = user = 'https://github.com/ben-spiller' ``` bugs.python.org fields: ```python activity = actor = 'vinay.sajip' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'benspiller' dependencies = [] files = ['47914'] hgrepos = [] issue_num = 35185 keywords = [] message_count = 6.0 messages = ['329429', '346058', '346119', '346135', '346139', '346143'] nosy_count = 2.0 nosy_names = ['vinay.sajip', 'benspiller'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue35185' versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6', 'Python 3.7'] ```

f0cc6a08-f8fc-4d78-b1be-3c701a5c623a commented 5 years ago

I just came across a fairly serious thread-safety / race condition bug in the logging.Loggers class, which causes random log lines to be lost i.e. not get passed to some of the registered handlers, if (other, unrelated) handlers are being added/removed using add/removeHandler from another thread during logging. This potentially affects all log handler classes, though for timing reasons I've found it easiest to reproduce with the logging.FileHandler.

See attached test program that reproduces this.

I did some debugging and looks like although add/removeHandler are protected by _acquireLock(), they modify the self.handlers list in-place, and the callHandlers method iterates over self.handlers with no locking - so if you're unlucky you can end up with some of your handlers not being called.

A trivial way to fix the bug is by editing callHandlers and copying the list before iterating over it:

for hdlr in c.handlers: + for hdlr in list(c.handlers):

However since that could affect the performance of routine log statements a better fix is probably to change the implementation of add/removeHandler to avoid in-place modification of self.handlers so that (as a result of the GIL) it'll be safe to iterate over the list in callHandlers, e.g. change removeHandler like this:

- self.handlers.remove(hdlr)
+ newhandlers = list(self.handlers)
+ newhandlers.remove(hdlr)
+ self.handlers = hdlr

(and the equivalent in addHandler)

vsajip commented 5 years ago

The other alternative would be to lock around callHandlers(). With the change you propose to addHandler/removeHandler, there are no errors, but the output of your test program is (with the lock acquisition/release in place):

Thread finished after 468 iterations Thread finished after 488 iterations Thread finished after 476 iterations Thread finished after 462 iterations

If the lock acquisition/release is removed from addHandler/removeHandler, then there are still no errors, and the output is:

Thread finished after 479 iterations Thread finished after 453 iterations Thread finished after 468 iterations Thread finished after 469 iterations

If I leave addHandler/removeHandler as they were and add locking around callHandlers(), there are no errors and the output is:

Thread finished after 566 iterations Thread finished after 608 iterations Thread finished after 612 iterations Thread finished after 605 iterations

This seems to suggest that locking around callHandlers() is better (more performant) than the copying of handler lists involved in your suggestion. Any comments on that? Also it will work in non-GIL environments like Jython/IronPython.

The callHandlers() locking will of course add a cost even for those situations where handlers aren't added and removed in multi-threading scenarios, but I haven't benchmarked it yet. It's generally not good practice to add/remove handlers willy-nilly in threads, though of course it's not forbidden (e.g. the configuration functionality allows a thread to listen for configuration changes at runtime and then reconfigure logging, which adds/removes handlers in the thread. However, loggers are disabled while the reconfiguration is in progress, and some loss of messages would be expected to be tolerable in such a case).

f0cc6a08-f8fc-4d78-b1be-3c701a5c623a commented 5 years ago

I'd definitely suggest we go for a solution that doesn't hit performance of normal logging when you're not adding/removing things, being as that's the more common case. I guess that's the reason why callHandlers was originally implemented without grabbing the mutex, and we should probably keep it that way. Logging can be a performance-critical part of some applications and I feel more comfortable about the fix (and more confident it won't get vetoed :)) if we can avoid changing callHandlers().

You make a good point about ensuring the solution works for non-GIL python versions. I thought about it some more... correct me if I'm wrong but as far as I can see the second idea I suggested should do that, i.e.
- self.handlers.remove(hdlr)
+ newhandlers = list(self.handlers)
+ newhandlers.remove(hdlr)
+ self.handlers = hdlr

... which effectively changes the model so that the _value_ of the self.handlers list is immutable (only which list the self.handlers reference points to changes), so without relying on any GIL locking callHandlers will still see the old list or the new list but never see an inconsistent value, since such a list never exists. That solves the read-write race condition; we'd still want to keep the existing locking in add/removeHandler which prevents write-write race conditions.

What do you think?

vsajip commented 5 years ago

I'd definitely suggest we go for a solution that doesn't hit performance of normal logging

I agree, but correctness is important. I'm tending to the following:

Introduce a lockCallHandling module-level variable, defaulting to False to maximise logging performance (and follow current behaviour) and settable to True for situations such as your example, where adding/removing handlers from threads is wanted.
In Logger.handle, add an extra check for lockCallHandling to decide whether to lock/release around callHandlers().

BTW a simple benchmark of performance of locking around callHanders vs. not locking, using a FileHandler and default StreamHandler, showed an average difference of ~ 480 usec per iteration (mean time of 47.310 secs unlocked vs. 47.784 locked, for a million iterations of logger.debug() - a 1% increase).

the second idea I suggested should do that

Yes, but the copying seems to make things slower, as suggested by the output of your script (in terms of the iteration counts in a fixed amount of time).

f0cc6a08-f8fc-4d78-b1be-3c701a5c623a commented 5 years ago

Interesting conversation. :)

Yes I agree correctness is definitely top priority. :) I'd go further and say I'd prefer correctness to be always there automatically, rather than something the user must remember to enable by setting a flag such as lockCallHandling. (as an aside, adding a separate extra code path and option like that would require a bit more doc and testing changes than just fixing the bug by making the self.handlers list immutable, which is a small and simple change not needing extra doc).

I'm also not convinced it's worth optimizing the performance of add/remove logger (which sounds like it's the goal of the callHandlers-locking approach you suggest, if I'm understanding correctly?) - since in a realistic application is always that's going to be vastly less frequent than invoking callhandlers. Especially if it reduces performance of the main logging, which is invoked much more often. Though admittedly the 1% regression you quoted isn't so bad (assuming that's true in CPython/IronPython/Jython/others). The test program I provided is a contrived way of quickly reproducing the race condition, but I certainly wouldn't use it for measuring or optimizing performance as it wasn't designed for that - the ratio of add/remove loggers to callhandlers calls is likely to be unrepresentative of a real application, and there's vastly more contention on calling add/remove loggers than you'd see in the wild.

Do you see any downsides to the immutable self.handlers approach, other than performance of add/remove logger being a little lower?

Personally I think we're on safer ground if we permit add/remove logger be slightly slower (but at least correct! correctness trumps performance), but only if we avoid regressing the more important performance of logging itself.

Does that seem reasonable?

vsajip commented 5 years ago

I'd prefer correctness to be always there automatically, rather than something the user must remember to enable by setting a flag such as lockCallHandling

Yes, I agree, I was just looking at different approaches while mulling all this over. Correctness-by-default can of course always be implemented by setting lockCallHandling to True by default, which is what I have currently in my benchmark testing.

Re. your change:

- self.handlers.remove(hdlr)
+ newhandlers = list(self.handlers)
+ newhandlers.remove(hdlr)
+ self.handlers = hdlr

Did you mean self.handlers = newhandlers in that last line? That's what I assumed you meant, but you've posted that twice now, so I'm not sure what you're getting at.

I agree that having a slower add/removeHandler is acceptable - certainly better than slowing down Logger.handle(). But I'm not sure about

so without relying on any GIL locking callHandlers will still see the old list or the new list

on platforms other than CPython, since we can't be sure where thread pre-emption might happen (i.e. it could happen inside a source code line boundary) and locking at least has the benefit of having defined semantics.

The other issue is where someone might have subclassed and overridden add/removeHandler methods, for whatever reason. I know it's unlikely, but one never knows.

I think we're broadly on the same page. I'm just not yet sure what the best approach is :-)

lisasgoh commented 1 month ago

Hey, I'm trying to implement thread-specific logging where each thread logs to a different file. One of the solutions I tried was to set up a log handler at the start of thread execution, set with a filter that filters for the thread ID. Before the thread terminates, the handler will be removed, but this creates a concurrency issue as it ends up removing other handlers erroneously, so certain log lines don't emitted. Is a fix being tracked for this or are there better ways of going about this?

vsajip commented 1 month ago

Not sure why removing one handler would remove other handlers - that might be a problem with how you've set things up. Creating a log file per thread sounds like an antipattern to me (not saying it definitely is, as I don't know your exact scenario) - it's easy enough to add thread ID to your log statements and separate them out using post-processing or even just via grep. Here is a link to a 2010 blog post which features filtering in a multi-threaded environment. Th example filters by a web application ID rather than thread ID, but it should be adaptable to your use case if that's what you need.

lisasgoh commented 1 month ago

Thanks for the link! The solution there makes sense, but my use case is slightly different - there's somewhat limited time to process the logs for each thread/request, so I thought it'd be safer to make each thread log to a different file so that they can be processed in parallel instead of a single file where it'd be slower.

Regarding how removing one handler would affect other handlers, it seems to be the same issue as the original one of random log lines to be lost i.e. not get passed to some of the registered handlers, if (other, unrelated) handlers are being added/removed using add/removeHandler from another thread during logging, cause when I changed it to be list(c.handlers), the tests consistently pass whereas it flakes previously. I think it's because the handlers are attached to the root logger, so the list of handlers is shared across all threads, and when it gets modified (when a thread finishes execution, the corresponding handler gets removed) while the other threads are iterating over the list, this causes the iterator index to get messed up, and the wrong handler gets removed.

Another way would be to attach a handler to the root logger that stores a dictionary which maps a thread to its corresponding file handler, and it would be responsible for delegating the logs to the correct handler. What do you think about this?

vsajip commented 1 month ago

so that they can be processed in parallel

There's no actual parallelism due to the GIL, so I'm not sure what you mean. I can't look at your case in detail, as I have little time at the moment, but a simple solution is better than a more complicated one. Have you actually benchmarked the "one handler for all threads" vs. "one handler per thread" case?