Open novoselov-ab opened 2 years ago
tl;dr: Should it be triaged/closed as wontfix?
It seems to me[^1] that this is by design. The relevant code where the repeated os.stat
is happening is when loading the spec (after a cache invalidate, as you indicate in your code):
Your base assumption
Overall, importlib is very good at caching and not searching over and over again.
is only true for already imported modules. These are sitting in sys.module
, so we don't even need to touch the file system at that point. It's a whole different level, so not relevant to this issue. We can't infer from this, that the filesystem and everything where we load a module from won't have changed at the point in time we want to load an unloaded module.
[^1]: I'm a beginner in this codebase, so take with a grain of salt
is only true for already imported modules
Not only. It is ofc great for imported modules because it doesn't even do the search. But when importing it does have other caches, it caches file system access. Look for importlib.invalidate_caches()
and sys.path_importer_cache
. I don't know code well either, but I think it creates finder once and does all stat
calls once. If you call that importlib.invalidate_caches()
you would get a lot of the same calls again.
We can't infer from this, that the filesystem and everything where we load a module from won't have changed at the point in time we want to load an unloaded module.
Here I think importlib
does a reasonable tradeoff of speed over consistency. If you change your files you have to explicitly invalidate cache, otherwise, it might not notice that. But for 99% of cases where everything is just imported on startup from a static file system, it works faster much much faster. It is very unlikely that in that first second anything is supposed to change on filesystem during the import.
For our big project I added another cache manually, by overwriting stat during the import and it made it much faster. Before the change, each new sys.path
was increasing search time, because it kept touching the same files on all sys.path
within the same namespace. After the change number of sys.path
entries stopped influencing import time that much. Here is an old plot I made back then, it shows import time (Y) vs number of sys.path
. entries (X):
This code was added in https://github.com/python/cpython/commit/c541f8ef40634c379aab0e4537777203a910d355, which just caches listdir
.
I think not caching the isfile
check was on purpose, as the mtime
of the base path doesn't change on updates to subdirectories, hence it cannot be used as a cache key for the __init__.py
checks, and I don't know anything else that could be used instead.
If we want to optimize this, we'd probably need to have a shared cache. We should already have all the required information cached in the sys.path_importer_cache
finder instances.
Overall, importlib is very good at caching and not searching over and over again. For instance, if you add new
sys.path
it won't traverse previous paths again. Except for one thing: it will check for the existence of__init__.py
files on all namespace paths again.It may sound minor, but on a project with heavy namespace use and a lot of
sys.path
it means a lot of extra FS access on eachsys.path
change. On windows it is especially slow.Repro script:
After each import you will see again and again:
Tested on python 3.7, but previously I checked it is the same on the latest one.