vacanza / holidays

Generate and work with holidays in Python
https://pypi.org/project/holidays
MIT License
1.45k stars 460 forks source link

Synchronize entity dynamic imports #1794

Closed Izzette closed 4 months ago

Izzette commented 4 months ago

Proposed change

Perform dynamic imports in registry in dedicated worker thread to avoid deadlocks from concurrent calls to importlib.import_module for the same module from different threads. This may not be sufficient if other code outside of python-holdays is importing these moduels directly. I believe this cpython issue is related https://bugs.python.org/issue38884. This should resolve (through a different implementation) #1514.

Here is a flamegraph of the deadlock in action on one thread: image

My main concern with this PR is that it may introduce a slight performance degradation. I'm not sure adding a test will be feasible, as the deadlock is inherently racy. Making a reliable production in the python-holidays code would require adding a lot of synchronization points within python-holidays to instrument such a test and hammering the importlib.import_module with many threads and than calling importlib.invalidate_caches in a loop until the issue is reproduced.

Type of change

Checklist

Izzette commented 4 months ago

Ok, I've managed to get what I believe is a minimal reproduction, but on my M2 with Python 3.12.3 the deadlock detection in importlib seems to be working correctly, so I don't reproduce a "true" deadlock, but I suspect that this is the same deadlock that is affecting us in production. My code also does some benchmarking, but because the code that loads the module caches it, it doesn't really seem like it's likely to affect anyone ... and on top of it, it's only a degradation of about 8 µs (Average duration 4276.57 ns to Average duration 12868.65 ns).

Here's the deadlock detection in action:

Traceback (most recent call last):
  File "/Users/izzi/src/python-holidays/perf.py", line 125, in main
    t.join(30.0)
  File "/Users/izzi/src/python-holidays/perf.py", line 86, in join
    raise self.error
  File "/Users/izzi/src/python-holidays/perf.py", line 66, in run
    self._run_inner()
  File "/Users/izzi/src/python-holidays/perf.py", line 81, in _run_inner
    country_holidays(vals[1])
  File "/Users/izzi/src/python-holidays/holidays/utils.py", line 188, in country_holidays
    return getattr(holidays, country)(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/izzi/src/python-holidays/holidays/registry.py", line 207, in __call__
    cls = self.get_entity()
          ^^^^^^^^^^^^^^^^^
  File "/Users/izzi/src/python-holidays/holidays/registry.py", line 225, in get_entity
    self.entity = getattr(importlib.import_module(self.module_name), self.entity_name)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/Users/izzi/src/python-holidays/holidays/countries/isle_of_man.py", line 16, in <module>
    from holidays.countries.united_kingdom import UnitedKingdom, UnitedKingdomStaticHolidays
  File "<frozen importlib._bootstrap>", line 1357, in _find_and_load
  File "<frozen importlib._bootstrap>", line 418, in __enter__
  File "<frozen importlib._bootstrap>", line 346, in acquire
_frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('holidays.countries.united_kingdom') at 4397028400

Alarmingly I see two other bugs that seem to be race conditions due to the dynamic import:

Traceback (most recent call last):
  File "/Users/izzi/src/python-holidays/perf.py", line 125, in main
    t.join(30.0)
  File "/Users/izzi/src/python-holidays/perf.py", line 86, in join
    raise self.error
  File "/Users/izzi/src/python-holidays/perf.py", line 66, in run
    self._run_inner()
  File "/Users/izzi/src/python-holidays/perf.py", line 81, in _run_inner
    country_holidays(vals[1])
  File "/Users/izzi/src/python-holidays/holidays/utils.py", line 188, in country_holidays
    return getattr(holidays, country)(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/izzi/src/python-holidays/holidays/registry.py", line 207, in __call__
    cls = self.get_entity()
          ^^^^^^^^^^^^^^^^^
  File "/Users/izzi/src/python-holidays/holidays/registry.py", line 225, in get_entity
    self.entity = getattr(importlib.import_module(self.module_name), self.entity_name)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/Users/izzi/src/python-holidays/holidays/countries/__init__.py", line 80, in <module>
    from .isle_of_man import IsleOfMan, IM, IMN
ImportError: cannot import name 'IsleOfMan' from partially initialized module 'holidays.countries.isle_of_man' (most likely due to a circular import) (/Users/izzi/src/python-holidays/holidays/countries/isle_of_man.py)

And:

Traceback (most recent call last):
  File "/Users/izzi/src/python-holidays/perf.py", line 125, in main
    t.join(30.0)
  File "/Users/izzi/src/python-holidays/perf.py", line 86, in join
    raise self.error
  File "/Users/izzi/src/python-holidays/perf.py", line 66, in run
    self._run_inner()
  File "/Users/izzi/src/python-holidays/perf.py", line 81, in _run_inner
    country_holidays(vals[1])
  File "/Users/izzi/src/python-holidays/holidays/utils.py", line 188, in country_holidays
    return getattr(holidays, country)(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/izzi/src/python-holidays/holidays/registry.py", line 207, in __call__
    cls = self.get_entity()
          ^^^^^^^^^^^^^^^^^
  File "/Users/izzi/src/python-holidays/holidays/registry.py", line 225, in get_entity
    self.entity = getattr(importlib.import_module(self.module_name), self.entity_name)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1337, in _find_and_load_unlocked
KeyError: 'holidays.countries'

These errors seem to happen more frequently on the countries that import other countries (Isle of Man, American Samoa, etc.). I'm not surprised to see these errors, as while one thread is importing one module, another may be trying to access the partially initialized module and end up with either this KeyError or an ImportError due to using directly a partially initialized module.

We've also been able to confirm that this PR is working for us in our app when running locally, and doesn't degrade performance to any measurable extent.

Here's the code for the minimal-ish reproduction. I've done some extra tuning with conditions to try to get the reproduction to be more reliable, but of course it's inherently probabilistic. It's kind of messy, but it was super helpful for me to confirm the theory of our bug, and test this fix.

#!/usr/bin/env python

from typing import Optional
from types import TracebackType, FrameType
import time
import sys
import traceback
from threading import Thread, Condition, Lock
import importlib
import random

def bench():
    from holidays.registry import EntityLoader

    iters = 10_000
    start = time.monotonic_ns()
    for _ in range(iters):
        # Raw entity loader performance, but because the `HolidayBase` get cached in `holidays`
        # module, this ultimately should get called only once per country, pre-process, unless this
        # API is used directly.
        EntityLoader("holidays.countries.australia").get_entity()

    end = time.monotonic_ns()
    avg_duration = (end - start) / iters

    print(f"Average duration {round(avg_duration, 2):.2f} ns")

class Counter:
    cond: Condition
    count: int

    def __init__(self):
        self.cond = Condition(Lock())
        self.count = 0

    def inc(self):
        with self.cond:
            self.count += 1
            self.cond.notify_all()

    def wait(self, expected_count: int):
        while True:
            with self.cond:
                if self.count >= expected_count:
                    return

                self.cond.wait()

class Loader(Thread):
    counter: Counter
    cond: Condition
    error: Optional[Exception]

    def __init__(self, counter: Counter, cond: Condition, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.counter = counter
        self.cond = cond
        self.error = None

    def run(self):
        try:
            self._run_inner()
        except Exception as error:
            self.error = error

    def _run_inner(self):
        # Reload the things we need to test for a deadlock
        from holidays.utils import country_holidays
        from holidays.registry import COUNTRIES

        # Sync all threads to get them to execute the entity loader at the same time
        with self.cond:
            self.counter.inc()
            self.cond.wait()

        for vals in random.choices(list(COUNTRIES.values()), k=100):
            country_holidays(vals[1])

    def join(self, *args, **kwargs):
        super().join(*args, **kwargs)
        if not self.is_alive() and self.error is not None:
            raise self.error

# Fake exception class used to format the traceback for a "true" deadlock condition.
class FakeException(BaseException):
    pass

def main():
    bench()

    print("Reproducing deadlock / race conditions (this will take awhile) ...")
    reprod_attempts = 1_000
    reprod_threads = 100

    error_types_seen = set()

    for _ in range(reprod_attempts):
        # Clear the cached modules between each reproduction attempt
        for k in list(sys.modules.keys()):
            if k.split(".")[0] == "holidays":
                del sys.modules[k]

        importlib.invalidate_caches()

        counter = Counter()
        cond = Condition()
        ts: list[Thread] = []
        for _ in range(reprod_threads):
            t = Loader(counter, cond)
            t.start()
            ts.append(t)

        counter.wait(reprod_threads)
        with cond:
            cond.notify_all()

        for t in ts:
            try:
                t.join(30.0)
            except Exception as error:
                error_type = type(error)
                if error_type not in error_types_seen:
                    traceback.print_exception(error)
                    error_types_seen |= {error_type}

            if t.is_alive():
                frame: FrameType = sys._current_frames()[t.ident]
                tb = None
                while frame is not None:
                    tb = TracebackType(
                        tb, frame, frame.f_lasti, frame.f_lineno
                    )
                    frame = frame.f_back

                error = FakeException().with_traceback(tb)
                raise TimeoutError("Deadlock occurred") from error

if __name__ == "__main__":
    main()

Based on the above reproduction code, it seems that this PR fixes all 3 bugs caused by racy-ness in dynamic imports when using multithreading.

coveralls commented 4 months ago

Pull Request Test Coverage Report for Build 9187711103

Details


Totals Coverage Status
Change from base Build 9164625446: 0.0%
Covered Lines: 11274
Relevant Lines: 11274

💛 - Coveralls
Izzette commented 4 months ago

I've updated the code to use a re-entrant lock instead of a thread pool. It has one main advantage over the thread pool strategy: it's re-entry safe. With a thread pool, if the code were ever to be updated so that importing a module using the lazy loader could lazy load another module (I'm not sure why we'd do this), it could deadlock depending on import order when waiting on the future for the work to be executed on the same thread. I don't think this will add any other deadlock potential, and it still passes the minimal reproduction check I shared. For bonus points it reduces the already negligible performance impact of the thread pool strategy to undetectably negilgable (I got Average duration 4621.12 ns with RLock) and it doesn't spawn a new thread, consuming RLIMIT_NPROC limit or the kernel task limit. I think the gopher in me just go carried away ...

arkid15r commented 4 months ago

Hi @Izzette,

Thanks for going above and beyond in order to identify and fix the issue. I have to admit this isn't a kind of PRs we normally receive. I'm going to look into it this weekend and do my best to squeeze the changes into our May 20th release.

Again, I really appreciate the detailed description.

sonarcloud[bot] commented 4 months ago

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

KJhellico commented 4 months ago

LGTM.

PPsyrius commented 4 months ago

LGTM

arkid15r commented 4 months ago

Alright, merging it then.

@Izzette thanks for your contribution!