tweag / FawltyDeps

Python dependency checker
Other
188 stars 14 forks source link

Bug: Dependency to Import mapping on the `qiskit` package does not include the `qiskit` import name #176

Open Nour-Mws opened 1 year ago

Nour-Mws commented 1 year ago

FawltyDeps v0.3.1

When running FawltyDeps (in the customary FawltyDeps env) on the Algorithms project, the lists of unused deps and undeclared imports do not contain the qiskit package as it is both used and declared: Excerpt:

{
          "name": "qiskit",
          "source": {
            "path": "quantum/q_full_adder.py",
            "lineno": 13
          }
}
{
      "name": "qiskit",
      "source": {
        "path": "requirements.txt"
      },
      "import_names": [
        "qiskit"
      ],
      "mapping": "IDENTITY"
    },

As FawltyDeps defaults to identity mapping in this instance, the import and dependency name get matched.

When I tried installing all declared dependencies of Algorithms (+FawltyDeps) in a new environment, I got qiskit in the list of undeclared and unused:

These imports appear to be undeclared dependencies:
- 'django'
- 'mpmath'
- 'pytest'
- 'qiskit'
- 'scipy'
- 'seaborn'
These dependencies appear to be unused (i.e. not imported):
- 'keras'
- 'projectq'
- 'qiskit'
- 'texttable'
- 'yulewalker'

Looking at the json output, I got:

   {
      "name": "qiskit",
      "references": [
        {
          "name": "qiskit",
          "source": {
            "path": "requirements.txt"
          },
          "import_names": [
            "test",
            "setup"
          ],
          "mapping": "DEPENDENCY_TO_IMPORT"
        }
      ]
    },

No qiskit to be seen in the import_names!

When I tried to import qiskit in Python in that venv, however, it worked without problem!

I thought for a second that qiskit might be exposed by other packages in the environment, but in that case it would have been matched and it wouldn't have appeared in the undeclared imports.

I tried importing qiskit in the FawltyDeps-only environment and Python couldn't find it (as expected).

I haven't dug deeper than that. My next step would be to see how Python's import finder managed to find qiskit in the 2nd environment in the first place.

jherland commented 1 year ago

I'm guessing there's something weird about qiskit's metadata causing packages_distributions() to return test and setup, but not qiskit. Maybe this issue is related to #173 in that it's yet another instance of the metadata of Python packages not being 100% precise/accurate, but still "close enough" that it works in practice?

jherland commented 1 year ago

I started digging more into this...

The qiskit package itself:

Why does packages_distributions() return import names setup and test for this package?

Looking at https://github.com/python/importlib_metadata/blob/main/importlib_metadata/__init__.py#L878 packages_distributions() follows a three-level strategies to find the import names that belong to a package:

  1. _top_level_declared(): If the package contains a top_level.txt file, that determines the import names.
  2. _top_level_inferred(): Otherwise, try to infer from the files property of the package, more specifically, look for .py files in the files property, and use those to deduce package/module names that must be importable. So how do we find files? 2.1. _read_files_distinfo(): Files are found by looking at the RECORD file inside the package metadata directory, or: 2.2. _read_files_egginfo(): Files are found by looking at the SOURCES.txt file inside the package metadata directory.

I've searched around various virtualenvs on my machine, looking for top_level.txt, RECORD, and SOURCES.txt files, and AFAICS:

For the record here is the full contents of qiskit's SOURCES.txt:

AUTHORS
LICENSE.txt
README.md
setup.py
qiskit.egg-info/PKG-INFO
qiskit.egg-info/SOURCES.txt
qiskit.egg-info/dependency_links.txt
qiskit.egg-info/requires.txt
qiskit.egg-info/top_level.txt
test/test_metapackage.py
test/test_simulation.py

From here, the code in step 2 above deduces that there are two import names available: setup (from setup.py), and test (from the two .py files under test/).

This list of files is accurate with respect to what's in the qiskit tarball, but not accurate when compared to what ends up installed in site-packages (see the first half of this comment).

At this point in time I'm not sure if the problem is that qiskit's SOURCES.txt is incorrect in terms of what should have been in there, or if the problem is that importlib_metadata has the wrong expectations about what SOURCES.txt files are supposed to contain.

jherland commented 1 year ago

For the record, qiskit's own README.md file supports my impression of it being a meta-package that only exists to automatically install other packages:

Qiskit is made up of elements that work together to enable quantum computing. This is a simple meta-package to install the elements of Qiskit altogether.

jherland commented 1 year ago

continuing my deep dive into the obscure packaging practices of Python...

What is SOURCES.txt and is it a relevant source of information?

I found some relevant documentation in setuptools about the SOURCES.txt file:

Although this metadata is included with distributed eggs, it is not actually used at runtime for any purpose. Its function is to ensure that setuptools-built source distributions can correctly discover what files are part of the project’s source, even if the list had been generated using revision control metadata on the original author’s system.

In other words, SOURCES.txt has little or no runtime value for being included in distributed eggs, and it is possible that future versions of the bdist_egg and install_egg_info commands will strip it before installation or distribution. Therefore, do not rely on its being available outside of an original source directory or source distribution.

In addition, this documentation is buried within a section called Guides on backward compatibility & deprecated practice in the setuptools docs.

From that documentation, it would seem like importlib_metadata is wrong to read SOURCES.txt and use it as a source for runtime-available modules/packages. I'm considering opening an issue against importlib_metadata about its use of SOURCES.txt, but I'm not sure that I have a good suggestion for what importlib_metadata should do instead in this corner case...

Furthermore, the corresponding stdlib module importlib.metadata which has been added in recent Python version has exactly the same problem (in fact, it looks like the stdlib module just copied in the importlib_metadata in commit https://github.com/python/cpython/commit/1bbf7b661f0ac8aac12d5531928d9a85c98ec1a9).

However, as part of this commit that created importlib.metadata there's this test vector: https://github.com/python/cpython/blob/main/Lib/test/test_importlib/fixtures.py#L175 This test example helps give an impression of what importlib_metadata expects a SOURCES.txt to look like:

mod.py
egginfo_pkg.egg-info/top_level.txt

This does indeed follow a similar layout from the SOURCES.txt that ships with qiskit, except that qiskit's only has .py files that are not actually installed, whereas this one mentions only files that are installed.

How easy is it to install a more modern "wheel-style" version of qiskit?

Not very hard it turns out!

I did some experiments with Python v3.10 and the venvs it create by default (containing pip-22.3.1 and setuptools-65.5.0)

qiskit is distritbuted as a source tarball. When we pip install qiskit, the default (and old-style) behavior in a clean venv is to unpack that source tarball in a temporary dir and run python3 setup.by install which delegates via python3 setup.by bdist to python3 setup.by bdist_egg. This creates the "egg" distribution with the problematic SOURCES.txt

However, if the wheel module is available when we run pip install qiskit, we will end up running a slightly different python3 setup.by bdist_wheel action, which builds a more modern qiskit-0.41.1-py3-none-any.whl from the tarball, and then installs that. This wheel has a RECORD file which does not have the same problem as SOURCES.txt:

qiskit-0.41.1.dist-info/AUTHORS,sha256=JOJ8ywsZHCVyC17lmbjgyE7E6qkiXxSz4-4mSLApmK0,8122
qiskit-0.41.1.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
qiskit-0.41.1.dist-info/LICENSE.txt,sha256=YhEPPmD1lMhhZ0900AcoOo7Y3-5EmFkTLhHSjlwrDZY,11416
qiskit-0.41.1.dist-info/METADATA,sha256=XpOLcR95UVUc-9MMb28ryRy9fFqzjR29MrfJH5-y7xA,10211
qiskit-0.41.1.dist-info/RECORD,,
qiskit-0.41.1.dist-info/REQUESTED,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
qiskit-0.41.1.dist-info/WHEEL,sha256=2wepM1nk4DS4eFpYrW1TTqPcoGNfHhhO_i5m4cOimbo,92
qiskit-0.41.1.dist-info/direct_url.json,sha256=U0lPU0NQXuxG0047XS2Tyjf9cEsq2ljM3Vpd4-FIZds,60
qiskit-0.41.1.dist-info/top_level.txt,sha256=AbpHGcgLb-kRsJGnwFEktk7uzpZOCcBY74-YBdrKVGs,1

From this metadata, packages_distributions() will (correctly) deduce that the qiskit package does not contribute any importable modules.

This new-style (and preferable) behavior can also be forced by passing --use-pep517 to pip install. Pip will then auto-install wheel before it proceeds with installing via python3 setup.by bdist_wheel.

Of course, these are all just workarounds: We cannot force our users to install qiskit with --use-pep517, and we cannot force qiskit to build and ship a wheel instead of their current source tarball.

(However, in the future, if FawltyDeps is ever in a position to e.g. install dependencies into its own venv for the purposes of dep-to-import resolution, then we should keep --use-pep517 in mind, as it will likely make our lives considerably easier...)

TL;DR: So, does any of this actually matter in the end?

I will admit to doing this deep-dive partially because I wanted to learn how deep the rabbit hole goes when it comes to packaging in Python. But we also have a real problem to solve here: From FawltyDeps' perspective we have a project that:

FD's analysis currently does not pick up the qiskit-terra and qiskit-ibmq-provider packages at all, as we only look for the declared qiskit package entry. And since this does not actually provide qiskit as an import name, we end up reporting it as an unused dependency.

Furthermore, since no other declared dependency provides qiskit, we also end up with qiskit reported as an undeclared dependency. Fun all around!

In some eyes, the identity mapping actually does a better job here, although that is actually somewhat misleading once you start digging into the details.

However, handling this Correctly™️ involves:

This will take some more thinking. In the meantime we might be better off introducing the identity mapping as not only a fallback mapping, but also a supplementary mapping that is used even when the dependency claims to provide other import names...

jherland commented 1 year ago

FWIW, I've added this comment to a similar issue in the importlib_metadata repo. Thinking about writing a PR for importlib_metadata as well.

jherland commented 1 year ago

Created https://github.com/python/importlib_metadata/pull/437 to have importlib_metadata return better information for qiskit and similar packages.

jherland commented 6 months ago

FWIW, there seems to be a recent (2024-02-15) change in qiskit upstream: Starting from v1.0.0 qiskit now does provide the qiskit import name itself (instead of relying on dependencies to populate the qiskit.* namespace). They mention this in their own release notes, as well as a separate page to document the change and how to upgrade across the v1.0 boundary.

It would seem that the entire qiskit issue simply goes away when qiskit >= v1.0 is used...