Handle case where background thread dies due to out of memory

ionelmc / python-redis-lock

Lock context manager implemented via redis SET NX EX and BLPOP.

https://pypi.python.org/pypi/python-redis-lock

BSD 2-Clause "Simplified" License

551 stars 77 forks source link

Handle case where background thread dies due to out of memory #92

Open baragona opened 2 years ago

baragona commented 2 years ago

I suspect I hit a case where out-of-memory on Linux triggered the OOM killer and might have picked a background refresher from this library.

After the OOM condition was resolved, many processes now seemingly hold the same locks, which ought to be impossible. (Unless the background refresher died, that is)

Note, I'm not completely sure this is how it happened.

Does anyone know if the Linux OOM killer will behave this way, and will Python behave this way (let the rest of the process continue if a thread dies)?

I'm not sure how one would even solve this in this library though - how can you react to a thread crashing without using another thread to check on it? That wouldn't solve anything.

ionelmc commented 2 years ago

The OOM killer will simply send a SIGKILL to processes that use too much memory. That means the process is aborted abruptly.

Threads share the same address space as the main thread so everything gets killed.

Perhaps your actual problem is the background refresher not working properly. Are you using anything that can affect how python threads work (like eventlet or gevent)?

baragona commented 2 years ago

Thanks for the info! Not using anything like eventlet or gevent here.

I looked at the error logging and I do see a bunch of these when the problem occurred:

        raise NotAcquired("Lock %s is not acquired or it already expired." % self._name)

So maybe an alternate theory is that if a refresher gets blocked for too long (due to high load or swapping), some other process may grab the lock.

It might be nice to detect this and optionally raise an error in the thread that thinks it holds a lock but doesn't.

ionelmc commented 2 years ago

Well in theory there could be some platform-specific handling (like for linux there could be another thread checking that the lock is still valid and sigalarm the mainthread for interruption that would raise the exception you want) but there are so many corner cases and problems it's not worth doing at all.

We should discus what eventually can lead to a reproducer instead. Describe your code more. Does it really swap a lot? Do you run code that blocks the GIL (and doesn't allow the renewal thread to actually run)? Do you have any connection failures? What do you have in the logs? Perhaps configure logging to allow redis-lock's DEBUG level logging.