Closed lesteve closed 5 months ago
It is my understanding that the __path__
and __file__
attributes of module objects are implementation details of the import mechanism and should not be relied upon to access package resources. In particular, the __path__
attribute may be an empty list. The only supported way to access package resources is via the importlib.resources
module.
Editable installs as implemented by meson-python use a customized module loader to load the module code from the source directory and the build directory, as needed. A consequence of this, is that we need to make the __path__
attribute of package modules empty, to have the custom loader be called for all imports.
Agreed. For the record, I haven't encountered any issues when switching to importlib.resources
in NumPy/SciPy as far as I remember (aside from the API removals in importlib
for Python 3.12 that require if/else conditional code).
Thanks for your answer, it seems like a side-effect of this is that pkgutil.walk_packages
does not find subpackages.
This is a mild annoyance since we are using pkgutil.walk_packages
to collect some of our tests. We call this kind of tests "common tests", because they check some invariant for all the scikit-learn models (e.g. that if you fit a model on 1d array, you will get an error of a certain kind). See https://github.com/scikit-learn/scikit-learn/pull/28040#issuecomment-1889562642 for more details.
Please refer to the documentation of pkgutil.walk_packages()
. In particular:
Note: Only works for a finder which defines an iter_modules() method. This interface is non-standard, so the module also provides implementations for importlib.machinery.FileFinder and zipimport.zipimporter.
The fact that the custom loader exposes an empty __path__
has nothing to do with this other issue. We could implement this method for the import machinery used by editable installs, but as the interface is not standardized there is no guarantee that this will keep working for future Python releases. Thus, if anything, it is very low priority.
Looking at this a tiny bit more in depth, the interface of iter_modules()
is not even documented. If we want to implement this we would need to base it on what the other module finders implement, rather than on a specification.
OK thanks for your pointers, I'll try to look into it a bit more. I have to admit I am not familiar at all with this kind of subtle import machinery.
I'll note that this usage of pkg_util.walk_packages
works: https://github.com/scipy/scipy/blob/8e9d43838b7db8d40fc4567f3aafeed68466b464/scipy/_lib/tests/test_public_api.py#L241.
That test (pytest scipy/_lib -k test_all_modules_are_expected
) passes for me with an editable install.
It'd be better to avoid __path__
and __file__
completely though, because it's indeed not clear when this will or won't work.
OK weird, just to make 100% sure, can you confirm that scipy.__path__
is not []
then when using meson editable install?
If that's the case, there may be something that is sklearn
specific, I need to have a closer look.
Eh oh, good catch. The test is silently passing, but it is an empty list and hence the test does not work as intended in editable mode:
scipy.__path__: []
scipy.__file__: /Users/rgommers/code/bldscipy/scipy/__init__.py
scipy.__name__: scipy
Eh oh, good catch. The test is silently passing, but it is an empty list and hence the test does not work as intended in editable mode
OK so at least this is not scikit-learn specific :relieved:
Thanks for the feedback @dnicolodi. After a bit of trial and error, here is the only way I could get to write an alternative to pkgutil.walk_packages
that never relies on the __path__
attribute of inspected packages.
from pathlib import Path
from importlib.util import find_spec
from pkgutil import iter_modules
# Alternative to sklearn.__path__ that does not rely on sklearn.__file__:
def get_package_path(package):
spec = find_spec(package)
if spec is None:
raise ImportError(f"No module named {package!r}")
return [Path(spec.origin).parent]
print(f"{get_package_path('sklearn') = }")
# Alternative to pkgutil.walk_packages that does not rely on pkg.__path__:
def walk_packages(path=None, prefix='', onerror=None):
"""Yields ModuleInfo for all modules recursively
on path, or, if path is None, all accessible modules.
'path' should be either None or a list of paths to look for
modules in.
'prefix' is a string to output on the front of every module name
on output.
Note that this function must import all *packages* (NOT all
modules!) on the given path, in order to access the __path__
attribute to find submodules.
'onerror' is a function which gets called with one argument (the
name of the package which was being imported) if any exception
occurs while trying to import a package. If no onerror function is
supplied, ImportErrors are caught and ignored, while all other
exceptions are propagated, terminating the search.
Examples:
# list all modules python can access
walk_packages()
# list all submodules of ctypes
walk_packages(ctypes.__path__, ctypes.__name__+'.')
"""
def seen(p, m={}):
if p in m:
return True
m[p] = True
for info in iter_modules(path, prefix):
yield info
if info.ispkg:
try:
__import__(info.name)
except ImportError:
if onerror is not None:
onerror(info.name)
except Exception:
if onerror is not None:
onerror(info.name)
else:
raise
else:
path = get_package_path(info.name)
# don't traverse path items we've seen before
path = [p for p in path if not seen(p)]
yield from walk_packages(path, info.name+'.', onerror)
for info in walk_packages(get_package_path('sklearn'), 'sklearn.'):
print(info.name)
The walk_packages
function is just a copy of the original pkgutil.walk_packages
in the standard library (https://github.com/python/cpython/blob/d457345bbc6414db0443819290b04a9a4333313d/Lib/pkgutil.py#L39). The only line that I changed is the definition of the local path
variable that relies on my get_package_path
helper to use the loader retrieved by find_spec
instead of using the value found in __path__
.
It seems to fix the problem for scikit-learn when installed in meson-python editable mode, but it's a shame not to be able to use the standard pkg.walk_packages
when using the editable mode.
Shall this be considered a bug in pkgutil.walk_packages
? I am not familiar enough with the package import machinery to tell whether or not its use of the __path__
attribute is unsafe.
By reading more about import machinery in CPython and the source code of mesonpy, I think this is just a missing feature in mesonpy's custom module finder for editable install and the fix is quite straightforward, see #562.
The implementation of get_package_path()
above is a complex way to get to os.path.dirname(__file__)
. The walk_packages()
implementation is still based on pkgutil.iter_modules()
that does not work for module finders that implement loading of modules from anything that is not a filesystem structure that mirrors the logical organization of the modules. The editable install module finder does just that. I haven't had time yet to investigate closely if there is a way to make pkgutil.iter_modules()
work for custom module finders. If there is, it is not documented and it is not part of the interfaces that a standard module finder is supposed to be implemented. This is hinted from the documentation I quoted in a previous comment.
I noticed this when working on moving scikit-learn to Meson. I tried with numpy and get the same result so I am guessing this is not something due to our particular setup?
With editable install (
pip install -e . --no-build-isolation
)When I use normal install (
pip install . --no-build-isolation
), I get something like this:This is not a huge issue since we were able to use
__file__
for our needs, but I thought I would open an issue anyway.