`zipimport.invalidate_caches()` implementation causes performance regression for `importlib.invalidate_caches()`

desmondcheongzx commented 1 year ago

In Python 3.10+, an implementation of zipimport.invalidate_caches() was introduced.

An Apache Spark developer recently identified this implementation of zipimport.invalidate_caches() as the source of performance regressions for importlib.invalidate_caches(). They observed that importing only two zipped packages (py4j, and pyspark) slows down the speed of importlib.invalidate_caches() up to 3500%. See the new discussion thread on the original PR where zipimport.invalidate_caches() was introduced for more context.

The reason for this regression is an incorrect design for the API.

Currently in zipimport.invalidate_caches(), the cache of zip files is repopulated at the point of invalidation. This violates the semantics of cache invalidation which should simply clear the cache. Cache repopulation should occur on the next access of files.

There are three relevant events to consider:

The cache is accessed while valid
invalidate_caches() is called
The cache is accessed after being invalidated

Events (1) and (2) should be fast, while event (3) can be slow since we're repopulating a cache. In the original PR, we made (1) and (3) fast, but (2) slow. To fix this we can do the following:

Add a boolean flag cache_is_valid that is set to false when invalidate_caches() is called.
In _get_files(), if cache_is_valid is true, use the cache. If cache_is_valid is false, call _read_directory().

This approach avoids any behaviour change introduced in Python 3.10+ and keeps the common path of reading the cache performant, while also shifting the cost of reading the directory out of cache invalidation.

We can go further and consider the fact that we rarely expect zip archives to change. Given this, we can consider adding a new flag to give users the option of disabling implicit invalidation of zipimported libaries when importlib.invalidate_caches() is called.

cc @brettcannon @HyukjinKwon

Linked PRs

gh-103208

arhadthedev commented 1 year ago

@Yhg1s (as a zipimport module expert)

brettcannon commented 1 year ago

Since zipimport.invalidate_caches() was introduced in Python 3.10, I don't think this is a regression but a potential place to improve performance.

desmondcheongzx commented 1 year ago

@brettcannon In my opinion, it might still be considered a regression for importlib (not zipimport), since prior to 3.10, importlib.invalidate_caches would be a no-op on zipimported libraries. Post 3.10, importlib.invalidate_caches seems to slow down significantly more than with a proper implementation of zipimport.invalidate_caches.

The practical difference is whether we backport this change to 3.11. It feels to me that there's value in doing so.

eliyahweinberg commented 2 weeks ago

https://github.com/python/cpython/issues/103200 : Fix performance issues with zipimport.invalidate_caches() ( … 1fb9bd2 https://github.com/python/cpython/pull/103208) Is not merged to python 3.11

brettcannon commented 2 weeks ago

@eliyahweinberg correct, it wasn't backported at all.

python / cpython

`zipimport.invalidate_caches()` implementation causes performance regression for `importlib.invalidate_caches()` #103200

Linked PRs