python / cpython

The Python programming language
https://www.python.org/
Other
61.17k stars 29.52k forks source link

importlib does unnecessary os.stat calls checking for namespace __init__.py files #91519

Open novoselov-ab opened 2 years ago

novoselov-ab commented 2 years ago

Overall, importlib is very good at caching and not searching over and over again. For instance, if you add new sys.path it won't traverse previous paths again. Except for one thing: it will check for the existence of __init__.py files on all namespace paths again.

It may sound minor, but on a project with heavy namespace use and a lot of sys.path it means a lot of extra FS access on each sys.path change. On windows it is especially slow.

Repro script:


import tempfile
import sys
import importlib
import os
from pathlib import Path

MODULES = [
    "namespace_1/foo.py",
    "namespace_1/bar.py",
    "namespace_1/xyz.py",
    "namespace_2/foo.py",
    "namespace_2/bar.py",
    "namespace_2/xyz.py",
]

with tempfile.TemporaryDirectory() as root:
    def _create_module(path):
        path = Path(root).joinpath(path)
        path.parent.mkdir(exist_ok=True)
        open(path, "w").close()

    for m in MODULES:
        _create_module(m)

    def my_stat(path):
        print(path)
        return os.stat(path)

    importlib._bootstrap_external._path_stat = my_stat
    sys.path.append(root)

    for i, m in enumerate(MODULES):
        # Critical line: invalidate sys.path cache
        sys.path.append(root + "/{i}")

        module_name = m.replace("/", ".").replace(".py", "")
        print(f"> importing: {module_name}")
        importlib.import_module(module_name)

After each import you will see again and again:

C:\Users\[user]\AppData\Local\Temp\tmpz9rexobt\namespace_2\__init__.cp37-win_amd64.pyd
C:\Users\[user]\AppData\Local\Temp\tmpz9rexobt\namespace_2\__init__.pyd
C:\Users\[user]\AppData\Local\Temp\tmpz9rexobt\namespace_2\__init__.py
C:\Users\[user]\AppData\Local\Temp\tmpz9rexobt\namespace_2\__init__.pyw
C:\Users\[user]\AppData\Local\Temp\tmpz9rexobt\namespace_2\__init__.pyc

Tested on python 3.7, but previously I checked it is the same on the latest one.

zormit commented 12 months ago

tl;dr: Should it be triaged/closed as wontfix?

It seems to me[^1] that this is by design. The relevant code where the repeated os.stat is happening is when loading the spec (after a cache invalidate, as you indicate in your code):

https://github.com/python/cpython/blob/9eeb4b485f4f9e13e5cb638e43a0baecb3ff4e86/Lib/importlib/_bootstrap_external.py#L1624

Your base assumption

Overall, importlib is very good at caching and not searching over and over again.

is only true for already imported modules. These are sitting in sys.module, so we don't even need to touch the file system at that point. It's a whole different level, so not relevant to this issue. We can't infer from this, that the filesystem and everything where we load a module from won't have changed at the point in time we want to load an unloaded module.

[^1]: I'm a beginner in this codebase, so take with a grain of salt

novoselov-ab commented 11 months ago

is only true for already imported modules

Not only. It is ofc great for imported modules because it doesn't even do the search. But when importing it does have other caches, it caches file system access. Look for importlib.invalidate_caches() and sys.path_importer_cache. I don't know code well either, but I think it creates finder once and does all stat calls once. If you call that importlib.invalidate_caches() you would get a lot of the same calls again.

We can't infer from this, that the filesystem and everything where we load a module from won't have changed at the point in time we want to load an unloaded module.

Here I think importlib does a reasonable tradeoff of speed over consistency. If you change your files you have to explicitly invalidate cache, otherwise, it might not notice that. But for 99% of cases where everything is just imported on startup from a static file system, it works faster much much faster. It is very unlikely that in that first second anything is supposed to change on filesystem during the import.

For our big project I added another cache manually, by overwriting stat during the import and it made it much faster. Before the change, each new sys.path was increasing search time, because it kept touching the same files on all sys.path within the same namespace. After the change number of sys.path entries stopped influencing import time that much. Here is an old plot I made back then, it shows import time (Y) vs number of sys.path. entries (X):

image

FFY00 commented 11 months ago

This code was added in https://github.com/python/cpython/commit/c541f8ef40634c379aab0e4537777203a910d355, which just caches listdir.

I think not caching the isfile check was on purpose, as the mtime of the base path doesn't change on updates to subdirectories, hence it cannot be used as a cache key for the __init__.py checks, and I don't know anything else that could be used instead.

If we want to optimize this, we'd probably need to have a shared cache. We should already have all the required information cached in the sys.path_importer_cache finder instances.