lmmx / impscan

Command line tool to identify minimal imports list and repository sources by parsing package dependency trees
MIT License
0 stars 1 forks source link

Discuss: conda package listings resolution #7

Open lmmx opened 3 years ago

lmmx commented 3 years ago

Need:

For conda I suspect this will suffice (no need to inspect wheel itself)

May need to 'sniff' each PyPi package (or at least one?) as described here

Note that what you call a package here is not a package but a distribution. A distribution can contain zero or more modules or packages. That means there is no one-to-one mapping of distributions to packages.

wheel packages have since been invented! Since a wheel is simply a zip file that gets extracted into the lib/site-packages directory, an examination of the contents of the wheel archive can give you the top level imports.

>>> import zipfile
>>> zf = zipfile.ZipFile('setuptools-35.0.2-py2.py3-none-any.whl')
>>> top_level = set([x.split('/')[0] for x in zf.namelist()])
>>> # filter out the .dist-info directory
>>> top_level = [x for x in top_level if not x.endswith('.dist-info')]
>>> top_level 
['setuptools', 'pkg_resources', 'easy_install.py']
lmmx commented 3 years ago

When parsing conda search listings [equivalent to index.json]

lmmx commented 3 years ago

Not all packages use hardlinked paths, some must resort to copying (source)

Where conda fails to create a hard link, it may fall back to either a symlink or a copy. Hardlinks may fail due to permissions error, or because the destination is on a different volume than the package cache. Hard links only work within a volume. Pay special attention to how your folders are mounted, as the fallback to copying is a big speed hit.

...but if this is only a failure case then it should(?) be possible to identify package import names from this info alone

lmmx commented 3 years ago

As an alternative to tar.bz2, some are in an uncompressed outer zip (renamed as .conda) with 2 internal .zst tarballs, one of which is info... (containing metadata) and one is pkg... (source)

On second thoughts:

On closer inspection, the pyzstd.decompress function does not delineate files(!) and although it’s not hard to figure out where the paths.json starts, it’d be cleaner to use it in the structured way zipfile and tarfile allow.

from pyzstd import ZstdFile
import requests
import zipfile
import tarfile
import io
import json

url = "https://repo.anaconda.com/pkgs/main/linux-64/requests-2.22.0-py37_1.conda"

b = requests.get(url, stream=True).raw.read()
z = zipfile.ZipFile(io.BytesIO(b))
info_zst = z.namelist()[1]
zz = z.read(info_zst)

class ZstdTarFile(tarfile.TarFile):
    def __init__(self, name, mode='r', *, level_or_option=None, zstd_dict=None, **kwargs):
        self.zstd_file = ZstdFile(name, mode,
                                  level_or_option=level_or_option,
                                  zstd_dict=zstd_dict)
        try:
            super().__init__(fileobj=self.zstd_file, mode=mode, **kwargs)
        except:
            self.zstd_file.close()
            raise

    def close(self):
        super().close()
        self.zstd_file.close()

zstd_tar = ZstdTarFile(io.BytesIO(zz))
zstd_files = zstd_tar.getnames()
pj = "info/paths.json"
r = zstd_tar.extractfile("info/paths.json")
j = json.load(r)
site_pkgs = set()

for d in j["paths"]:
  dp = d["_path"]
  suffix = dp.partition("/site-packages/")[-1]
  site_pkgs.add(suffix.split("/")[0])

for sp in site_pkgs:
  print(sp)

prints:

requests
requests-2.22.0.dist-info

Outside Python ecosystem, usable as sudo apt install zstd; tar -I zstd -xvf archive.zst

Not always available so follow these steps:

Result:

SELECT COUNT(*) FROM conda_packages ;    

2262

The total package count in the listings JSON is 20,094 (so 17,832 packages not covered) meaning 11% of the packages on conda are in the database

TODO: identify missing values within the packages if any?

SELECT packagename FROM conda_packages WHERE packagename LIKE "z%" LIMIT 3;

zarr
zc.lockfile
zeromq

whereas [x for x in j if x.startswith("z")] for j loaded from the listings JSON:

z5py
zaber-motion
zaber-serial
zappy

so z5py is the first example of a package on conda which didn't make it into the package database.

lmmx commented 3 years ago

Recall: the only purpose here is to determine if any of the package dependencies contain any packaged comprising the registered imports, and therefore if any of the registered imports can be dropped due to already being covered by another package's dependencies