reynoldsnlp / pipster

pipster: The pythonic way to `pip install`.
MIT License
8 stars 2 forks source link

redo already_loaded using __spec__.origin #3

Closed reynoldsnlp closed 1 year ago

reynoldsnlp commented 5 years ago

from @ncoghlan here

after the installation operation completes, read the RECORD entries for any just installed modules, and compare them to __spec__.origin and importlib.util.source_fromcache(__spec_\.origin) for all of the modules in sys.modules.values() and emit a warning to stderr suggesting a Python session restart if any of the just installed files is already loaded in the module cache.

ncoghlan commented 5 years ago

Right, as the current implementation runs into the problem that having the PyPI distribution package name be the same as the top level Python import package is only a common convention, rather than a hard requirement, so while there a lot of cases that it will correctly warn about, there are also quite a few that it will miss (e.g. while you install Pillow, you actually import from PIL, since Pillow was originally designed to serve as a drop-in replacement for an older project using the latter name).

When reworking this, there's also a smaller tweak that would be appropriate, which is to print the related warning messages to stderr rather than stdout (there's relatively few cases where that will make a practical difference, but stderr is the more suitable target stream)

reynoldsnlp commented 5 years ago

@ncoghlan What do you mean "read the RECORD entries for any just installed modules"?

Also, source_from_cache isn't as straightforward to use as I was hoping. I ran into the following error, which I assume has to do with PEP 3147 or 488: https://docs.python.org/3/library/importlib.html#importlib.util.source_from_cache. Help?

>>> sys.modules['pip']
<module 'pip' from '/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/__init__.py'>
>>> sys.modules['pip'].__spec__.origin
'/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/__init__.py'
>>> source_from_cache(sys.modules['pip'].__spec__.origin)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<frozen importlib._bootstrap_external>", line 324, in source_from_cache
ValueError: __pycache__ not bottom-level directory in '/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/__init__.py'
ncoghlan commented 5 years ago

The ValueError is reporting that the path you submitted isn't a valid __pycache__ entry, so the function can't infer an original source location from it. To see what the function is expecting as input, run the path you have there through importlib.util.cache_from_source:

>>> from importlib.util import cache_from_source, source_from_cache
>>> pip_path = sys.modules["pip"].__spec__.origin
>>> pip_path
'/usr/lib/python3.7/site-packages/pip/__init__.py'
>>> cache_from_source(pip_path)
'/usr/lib/python3.7/site-packages/pip/__pycache__/__init__.cpython-37.pyc'
>>> source_from_cache(cache_from_source(pip_path))
'/usr/lib/python3.7/site-packages/pip/__init__.py'

I think the actual error is in my initial notes though: I forgot that the source file loader sets origin as the source location (even if the module itself gets initialised from the cached bytecode file), so you should be able to just use __spec__.origin directly and not need to worry about converting it at all.

(There's a separate optional __spec__.cached attribute to let loaders report the case where they've loaded from the cache directory instead of directly from the origin file)

ncoghlan commented 5 years ago

For "reading the RECORD file", I'm referring to the installation database in PEP 376, where pip (and other compliant installers) make a full record of the files installed for each package: https://www.python.org/dev/peps/pep-0376/#record

And unlike sys.modules, that installation database uses distribution package names (the ones that get passed to pip install on the command line) rather than the import package names, so as a first attempt, the CLI wrapper can use the positional arguments to get the list of names that it needs to query. (Ideally it would also check all the dependencies that get upgraded, but that will be a bit tricky when implementing this as a CLI wrapper, so I think it would make sense to defer checking dependency reloading until after pip is providing this functionality natively)

You don't need to read the RECORD file directly, though: pip already includes a vendored copy of distlib (as pip._vendor.distlib), so for pip_inside, you can just depend on distlib and let it do the heavy lifting via https://distlib.readthedocs.io/en/latest/reference.html#InstalledDistribution.list_installed_files:


>>> requested_dist_packages = ["pip", "wheel"] # just an example
>>> import distlib, os.path
>>> dist_path = distlib.database.DistributionPath()
>>> dist_info = [dist_path.get_distribution(dist_name) for dist_name in requested_dist_packages]
>>> file_paths = set(os.path.join(os.path.dirname(project.path), path) for project in dist_info for path, __, __ in project.list_installed_files())
>>> import sys, pip
>>> sys.modules["pip"].__spec__.origin in file_paths
True
reynoldsnlp commented 5 years ago

https://github.com/reynoldsnlp/pip_inside/tree/%233-origin

@ncoghlan dist_path is behaving strangely for me. Works for some and doesn't work for others (It does work for both pip and wheel)

>>> from pip._vendor.distlib.database import DistributionPath
>>> dist_path = DistributionPath()
>>> for d in Path('/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages').glob('*'):
...   print(dist_path.get_distribution(d.stem), d.stem, sep='\t')
...
None    appdirs-1.4.3
None    olefile
None    decorator-4.3.0-py3.6
boto 2.48.0 boto
None    mesa
mistune 0.8.3   mistune
None    Pillow-4.3.0
Flask 0.12.2    flask
selenium 3.7.0  selenium
None    python_dateutil-2.6.1
None    pysptk-0.1.11
None    simplegeneric
tinysegmenter 0.3   tinysegmenter
appnope 0.1.0   appnope
packaging 16.8  packaging
None    simplegeneric-0.8.1-py3.6.egg-info
None    testpath-0.3.1
None    ply
None    tornado-5.1.1
None    thinc-6.10.2
thinc 6.10.2    thinc
...
ncoghlan commented 5 years ago

One big item I noticed after some local experimentation is that distlib.DistributionPath needs to have include_egg=True passed in for it to correctly pick up projects that are still emitting the old pre-PEP-376 egg-info format - by default it will only pick up the newer dist-info format defined in that PEP.

However, there are also a couple of challenges that apply specifically to your current attempted behaviour check:

Applying those three changes gives:

>>> import pathlib
>>> import distlib.database
>>> dist_path = distlib.database.DistributionPath(include_egg=True)
>>> for distpkg in sorted(pathlib.Path("/usr/lib/python3.7/site-packages/").glob("*.*-info")):
...     distname = distpkg.name.partition('-')[0]
...     print(f"{distname:20}{dist_path.get_distribution(distname)}")
... 
IPy                 IPy 0.81
PyGObject           PyGObject 3.30.4
PySocks             PySocks 1.6.8
SSSDConfig          SSSDConfig 2.0.0
asn1crypto          asn1crypto 0.24.0
blivet              blivet 3.1.2
blivet_gui          None
chardet             chardet 3.0.4
cupshelpers         cupshelpers 1.0
decorator           decorator 4.3.0
distro              distro 1.3.0
docutils            docutils 0.14
fros                fros 1.1
humanize            humanize 0.5.1
idna                idna 2.7
iniparse            iniparse 0.4
initial_setup       None
isc                 isc 2.0
langtable           langtable 0.0.40
ntplib              ntplib 0.3.3
olefile             olefile 0.46
ordered_set         None
pid                 pid 2.1.1
pip                 pip 18.1
pluggy              pluggy 0.6.0
ply                 ply 3.9
productmd           productmd 1.18
py                  py 1.5.4
pyOpenSSL           pyOpenSSL 18.0.0
pycparser           pycparser 2.14
pydbus              pydbus 0.6.0
pyinotify           pyinotify 0.9.6
pykickstart         pykickstart 3.16
pyparsing           pyparsing 2.2.0
pystray             pystray 0.14.3
python_augeas       None
python_meh          None
python_xlib         None
pytz                pytz 2018.5
pyudev              pyudev 0.21.0
requests            requests 2.20.0
requests_file       None
requests_ftp        None
sepolicy            sepolicy 1.1
setuptools          setuptools 40.6.3
sh                  sh 1.12.14
simpleline          simpleline 1.4
six                 six 1.11.0
slip                slip 0.6.4
slip.dbus           slip.dbus 0.6.4
sos                 sos 3.6
toml                toml 0.10.0
tox                 tox 3.4.0
urllib3             urllib3 1.24.1
virtualenv          virtualenv 16.0.0
ncoghlan commented 5 years ago

That initial list still had a few None responses in it, and it turned out those where due to the name on disk having _ where the original name has - (to abide by the naming restriction imposed by the fact that metadata directory naming scheme uses - as a field separator).

Reversing that transformation allows distlib to find all the entries in that directory on my system:

>>> import pathlib
>>> import distlib.database
>>> dist_path = distlib.database.DistributionPath(include_egg=True)
>>> for distpkg in sorted(pathlib.Path("/usr/lib/python3.7/site-packages/").glob("*.*-info")):
...     distname = distpkg.name.partition('-')[0].replace('_', '-')
...     print(f"{distname:20}{dist_path.get_distribution(distname)}")
... 
IPy                 IPy 0.81
PyGObject           PyGObject 3.30.4
PySocks             PySocks 1.6.8
SSSDConfig          SSSDConfig 2.0.0
asn1crypto          asn1crypto 0.24.0
blivet              blivet 3.1.2
blivet-gui          blivet-gui 2.1.10
chardet             chardet 3.0.4
cupshelpers         cupshelpers 1.0
decorator           decorator 4.3.0
distro              distro 1.3.0
docutils            docutils 0.14
fros                fros 1.1
humanize            humanize 0.5.1
idna                idna 2.7
iniparse            iniparse 0.4
initial-setup       initial-setup 0.3.62
isc                 isc 2.0
langtable           langtable 0.0.40
ntplib              ntplib 0.3.3
olefile             olefile 0.46
ordered-set         ordered-set 2.0.2
pid                 pid 2.1.1
pip                 pip 18.1
pluggy              pluggy 0.6.0
ply                 ply 3.9
productmd           productmd 1.18
py                  py 1.5.4
pyOpenSSL           pyOpenSSL 18.0.0
pycparser           pycparser 2.14
pydbus              pydbus 0.6.0
pyinotify           pyinotify 0.9.6
pykickstart         pykickstart 3.16
pyparsing           pyparsing 2.2.0
pystray             pystray 0.14.3
python-augeas       python-augeas 0.5.0
python-meh          python-meh 0.47
python-xlib         python-xlib 0.23
pytz                pytz 2018.5
pyudev              pyudev 0.21.0
requests            requests 2.20.0
requests-file       requests-file 1.4.3
requests-ftp        requests-ftp 0.3.1
sepolicy            sepolicy 1.1
setuptools          setuptools 40.6.3
sh                  sh 1.12.14
simpleline          simpleline 1.4
six                 six 1.11.0
slip                slip 0.6.4
slip.dbus           slip.dbus 0.6.4
sos                 sos 3.6
toml                toml 0.10.0
tox                 tox 3.4.0
urllib3             urllib3 1.24.1
virtualenv          virtualenv 16.0.0
reynoldsnlp commented 5 years ago

@ncoghlan, using include_egg=True breaks DistributionPath for me. Could it be connected to the fact that I'm using pip's bundled version of distlib?

>>> from pip._vendor.distlib.database import DistributionPath
>>> dp = DistributionPath()
>>> dp.get_distribution('pip')
<InstalledDistribution 'pip' 9.0.3 at '/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip-9.0.3.dist-info'>

... however ...

>>> dp = DistributionPath(include_egg=True)
>>> dp.get_distribution('pip')
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/metadata.py", line 727, in __init__
    self._data = json.loads(data)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/database.py", line 237, in get_distribution
    self._generate_cache()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/database.py", line 164, in _generate_cache
    for dist in self._yield_distributions():
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/database.py", line 154, in _yield_distributions
    yield old_dist_class(r.path, self)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/database.py", line 875, in __init__
    metadata = self._get_metadata(path)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/database.py", line 955, in _get_metadata
    metadata = Metadata(path=path, scheme='legacy')
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/metadata.py", line 737, in __init__
    scheme=scheme)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/metadata.py", line 281, in __init__
    self.read_file(fileobj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/metadata.py", line 381, in read_file
    self.set_metadata_version()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/metadata.py", line 287, in set_metadata_version
    self._fields['Metadata-Version'] = _best_version(self._fields)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/metadata.py", line 165, in _best_version
    raise MetadataConflictError('Unknown metadata set')
pip._vendor.distlib.metadata.MetadataConflictError: Unknown metadata set
reynoldsnlp commented 5 years ago

I just tried installing distlib separately, and it is giving me a different error, regardless of whether I pass it the include_egg=True argument.

>>> from distlib.database import DistributionPath
>>> dp = DistributionPath()
>>> dp.get_distribution('pip')
Traceback (most recent call last):
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/metadata.py", line 730, in __init__
    self._data = json.loads(data)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/database.py", line 240, in get_distribution
    self._generate_cache()
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/database.py", line 167, in _generate_cache
    for dist in self._yield_distributions():
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/database.py", line 148, in _yield_distributions
    metadata = Metadata(fileobj=stream, scheme='legacy')
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/metadata.py", line 740, in __init__
    scheme=scheme)
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/metadata.py", line 283, in __init__
    self.read_file(fileobj)
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/metadata.py", line 381, in read_file
    self.set(field, value)
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/metadata.py", line 476, in set
    if not scheme.is_valid_constraint_list(value):
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/version.py", line 713, in is_valid_constraint_list
    return self.is_valid_matcher('dummy_name (%s)' % s)
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/version.py", line 703, in is_valid_matcher
    self.matcher(s)
  File "/Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/version.py", line 116, in __init__
    '%r constraints' % op)
ValueError: '.*' not allowed for '>=' constraints
ncoghlan commented 5 years ago

I don't what might cause that.

@vsajip, @reynoldsnlp is attempting to use distlib to work out what modules are installed and read their RECORD files so we can use that to print a "this module is already loaded, you may need to restart Python" warning, but he's having trouble getting it to give consistent answers on Mac OS X.

Are the above issues a sign that there's a package with bad metadata on the system, or something else?

vsajip commented 5 years ago

a sign that there's a package with bad metadata on the system

Possibly - it seems to indicate a file that's invalid at the JSON level rather than the level of any schema. It may be that it's trying to interpret a non-JSON file as JSON. Perhaps if @reynoldsnlp could find what the file actually being read contains, we might be able to make progress.

It might also be worth trying the repo version of distlib rather than the last released version (0.2.8). I assume that's the standalone version used above.

ncoghlan commented 5 years ago

@reynoldsnlp It may be worth trying to catch the exception with a debugger to see exactly which file distlib is looking at when it fails. For example, with pdb.pm:

try:
    dp.get_distribution('pip')
except:
    import pdb; pdb.pm()
reynoldsnlp commented 5 years ago

I don't have much time to work on this this morning, but I thought I would at least post output of the debugger. It doesn't give any more information than the Tracebacks above.

using the bundled distlib with include_egg:

>>> from pip._vendor.distlib.database import DistributionPath
>>> dp = DistributionPath(include_egg=True)
>>> try:
...     dp.get_distribution('pip')
... except:
...     import pdb; pdb.pm()
...
> /opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip/_vendor/distlib/metadata.py(165)_best_version()
-> raise MetadataConflictError('Unknown metadata set')
>>>

...and using the standalone distlib:

>>> from distlib.database import DistributionPath
>>> dp = DistributionPath()
>>> try:
...     dp.get_distribution('pip')
... except:
...     import pdb; pdb.pm()
...
> /Users/myuser/Library/Python/3.6/lib/python/site-packages/distlib/version.py(116)__init__()
-> '%r constraints' % op)
ncoghlan commented 5 years ago

What you're really looking to find out is the state of the local variables in the _yield_distributions frame at the point the exception gets thrown: https://bitbucket.org/pypa/distlib/src/811516c666ebb80dc5dda95455790d1b9877cfc7/distlib/database.py#lines-148

In particular, metadata_path should point you directly at the metadata entry that's confusing distlib.

As an alternative to finding that in a debugger, another option would be to just edit that file to guard against metadata parsing failures by logging an exception and continuing on to the next item rather than letting the exception escape unhandled:

            try:
                with contextlib.closing(pydist.as_stream()) as stream:
                    metadata = Metadata(fileobj=stream, scheme='legacy')
            except Exception:
                logger.exception("Failed to parse metadata at %r, ignoring distribution", metadata_path)
                continue