python / importlib_metadata

Library to access metadata for Python packages
https://importlib-metadata.readthedocs.io
Apache License 2.0
122 stars 79 forks source link

`__pycache__` returned as an import name from `packages_distributions()` #456

Closed jherland closed 7 months ago

jherland commented 1 year ago

Related to #442 (metadata directories appearing in packages_distributions).

Some packages include pre-compiled .pyc files inside __pycache__ directories. When no top_level.txt is present, the RECORD file will be used to infer top-level import names, and for some packages it will contain entries like this (example taken from lib/python3.10/site-packages/typing_extensions-4.4.0.dist-info/RECORD in my Poetry virtualenv):

__pycache__/typing_extensions.cpython-310.pyc,,

This entry is included in files() and is then further interpreted by _top_level_inferred() as a __pycache__ import name being provided by this distribution/package.

The code probably needs an exception for __pycache__ directories. Maybe this is simple enough?

diff --git a/importlib_metadata/__init__.py b/importlib_metadata/__init__.py
index 76ccacb..e59d686 100644
--- a/importlib_metadata/__init__.py
+++ b/importlib_metadata/__init__.py
@@ -991,6 +991,6 @@ def _top_level_inferred(dist):

     @pass_none
     def importable_name(name):
-        return '.' not in name
+        return '.' not in name and name != "__pycache__"

     return filter(importable_name, opt_names)
jaraco commented 1 year ago

Hmm. I agree - that's a little annoying that the .pyc file is showing up in the RECORD file, but that sounds like a bug in the builder. If a project legitimately wishes to install a __pycache__ into site-packages, it will be importable, so it probably should show up as a top-level module installed by that distribution.

 draft $ mkdir __pycache__
 draft $ py -q
>>> import __pycache__
>>> __pycache__.__path__
_NamespacePath(['/Users/jaraco/draft/__pycache__', '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/__pycache__', '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/__pycache__'])

You may be right that it makes sense to explicitly exclude __pycache__, as it's commonly going to exist in sys.path even when it wasn't installed by a package. But as long as it's explicitly mentioned in the RECORD, I'm tempted to say that importlib metadata should honor that record.

Is there a legitimate reason for the __pycache__ entry to be present in RECORD?

jaraco commented 7 months ago

Since the directory does appear to be importable and since the package is explicitly reporting it as a top-level directory installed by the package, I'm going to say importlib_metadata is doing the right thing here and should be reflecting the presence of that top level name. I'd like to avoid special cases where possible.

Feel free to advocate for making a case for excluding that name. In particular, I'd like to know how the presence of this name causes harm (beyond a little surprise).

jherland commented 7 months ago

Your argument makes sense to me from the POV of importlib_metadata. I agree that it rather sounds like a bug in the builder.

The context I'm coming from is doing dependency checking where we need to be able to map from an import name we find in the user's code to a package name to be found in the user's declared dependencies (e.g. requirements.txt, pyproject.toml or similar). In this case, a package reporting __pycache__ as an "extra" import name is probably not going to create much of a problem for us, as we don't expect the user to import __pycache__ in their code. If this does end up becoming problematic, we can easily enough special-case it in our use of importlib_metadata.

Thanks for your consideration!

layday commented 7 months ago

I doubt this has anything to do with the builder - pip generates pyc files on install as per the wheel spec, which it appends to the RECORD so that it will know to uninstall them. Because typing_extensions is not a package but a single-file module, its __pycache__ is at the top level, which is why it is being returned by packages_distributions.