repology / repology-updater

Repology backend service to update repository and package data
https://repology.org
GNU General Public License v3.0
502 stars 177 forks source link

Add support for conda channels conda-forge and bioconda #518

Closed epruesse closed 5 years ago

epruesse commented 6 years ago

conda is a popular package manager in science. It installs packages into the user's home directory, supports "virtual environments" and custom "channels" (akin to ppa's). There are a few major channels that can be considered distributions in their own right, mainly bioconda and conda-forge.

Links: https://conda.io/docs/ https://conda-forge.org/ https://bioconda.github.io/

Channel packages are hosted at anaconda.org, which also offers an API for querying available packages. A repo dump can be obtained here:

https://conda.anaconda.org/{channel}/{arch}-{bits}/repodata.json

where {arch} is one of linux, osx and win, and bits one of 32 and 64. (Not sure how much 32 is used, bioconda e.g. builds only linux-64 and osx-64).

AMDmi3 commented 6 years ago

Thanks for the suggestion, looks doable.

AMDmi3 commented 6 years ago

I don't like that it does contain a lot of python modules not distinguishable from other software. We already have this problem with some other repos.

epruesse commented 6 years ago

Well, they would almost all be recognizable by having pypi in the source url. To get that, you'd have to parse the recipes though:

At least conda-forge and bioconda are built from recipes in maintained in github repositories, similar to brew based distributions. You could parse the recipes additionally/alternatively. (Conda-forge has a repo-per-package structure, but conda-forge/feedstocks has submodules for each of those, so recursive checkout would get you all of the current ones).

Each recipe has a meta.yaml which contains in source: url: the URL to the source archive. Conda-forge also keeps the maintainer in extra: maintainers: (github user names), Bioconda does not enforce this.

Code for parsing the meta.yaml exists in python in various places (bioconda uses bioconda/bioconda-utils conda-forge uses conda-forge/conda-smithy to collect tools); both rely on conda/conda-build to understand the actual package.

epruesse commented 6 years ago

(I'm guessing that parsing the source URL is the most reliable way to distinguish packages named differently in various distros. Sets of versions would give additional info, or keywords).

epruesse commented 6 years ago

FYI: conda also has a somewhat queer feature called "features". For each package-version-architecture combination, there still may be more than one binary.

These are tracking "features". E.g. py36 or py27 variants of a package might be present if the package contains binary code that needs to match the python version. Other examples are numpy111py36 to indicate something compiled for a specific numpy/python version combination, or r3.3.2 to indicate that it was build for a specific version of R. Conda-forge uses vc9, vc10 and vc14 to distinguish binaries using the visual studio runtimes 2008, 2010 and 2015, respectively.

You can see the features in both the package name and the "build" value in the repodata.json. The "build" value is string is a underscore separated list of features and the build number.

AMDmi3 commented 6 years ago

Yes, I've thought of using build values, but I don't think they are clear indication of python modules, e.g. they are likely set for python apps as well. Also I've seen e.g. python and octave features at the same time, which is even more confusing.

Could also try parsing recipes, I didn't go this way because I've seen separate repos for each recipe, but if there's single repo which may be used to check them out all at once it may be worth trying.

epruesse commented 6 years ago

Yes, I've thought of using build values, but I don't think they are clear indication of python modules, e.g. they are likely set for python apps as well. Also I've seen e.g. python and octave features at the same time, which is even more confusing.

They also should not be present for pure python modules marked as noarch: python.

Could also try parsing recipes, I didn't go this way because I've seen separate repos for each recipe, but if there's single repo which may be used to check them out all at once it may be worth trying.

You can find get all conda-forge recipes here: https://github.com/conda-forge/feedstocks - it just needs a lengthy recursive checkout.

To get all information, you may have to parse both. The repodata.json will list all available packages, while the recipe repositories contain only the recipes for the current version(s), but do have extra information such as maintainer, source URL, etc.

luzpaz commented 5 years ago

I would love for this to be bumped to a higher priority. FreeCAD relies heavily on conda for it's development builds.

epruesse commented 5 years ago

I don't like that it does contain a lot of python modules not distinguishable from other software. We already have this problem with some other repos.

By now I believe most R and Perl modules are recognizable by beginning with r- and perl-, respectively. Due to conda's history as an extension of sorts to pip, there is no separate python namespace, however. And as with all package repositories, nothing really enforces the naming schemes that exist.

Here's a little more info on the repodata.json:

URL template: https://conda.anaconda.org/{channel}{/label}/{subdir}/repodata.json

So we've got the channel, which is effectively a distribution unto itself. E.g. conda-forge or bioconda. The label, which can be e.g. /label/main or /label/broken or /label/gcc7 and is effectively a sub-distribution. And then the subdir, which is effectively the architecture/platform. It should always match (noarch|(linux|osx|win)-(32|64), although not all variants are offered in each channel. noarch contains non-binary packages such as pure python, R, perl or lua or meta packages that will work on any architecture.

In the repodata, the key repodata_version may be present. If it is one, there will be a removed key containing removed packages. The info key will always contain the subdir, the other information has changed between versions.

Of most interest is the package key, which contains keys for every downloadable filename (URL next to repodata). For each file, we have name, version, build_number indicating the fully qualified package. The build key is a string containing the build number and a hash of the pinned dependencies. This will e.g. change if the version of boost a package depends on changes and allows having a collection of binaries for each name/version/build_number compiled to match different libraries.

The keys 'arch' and 'platform' I'd ignore as they aren't consistently available and not used by conda itself.

The key depends contains a list of strings indicating the packages and versions required at runtime.

To get more information you'd have to parse the meta.yaml describing the package. Since each channel keeps these in different places, and usually in a git, the easiest way to consistently get this file would mean downloading the package. It's in info/recipe/meta.yaml, and contains all data from repodata.json/packages and further information such as the project home page, license, summary, maintainers, etc.

AMDmi3 commented 5 years ago

The code is committed, but I'm not enabling it. For instance, there are 65.0% unique (e.g. not matched with other repos) packages in bioconda, even with perl-/r- normalization, and these are mostly python modules. There's also a lot of incorrect merges due to the same reason. This would be mostly useless for conda and harmful for other repos, so not acceptable.

Also, I'm now hesitant to add repositories which do not provide information on project homepages, as this information is crucial for distinguishing similarly names projects, which are becoming more numerous with each added repo.

A compilation of preprocessed meta.yaml files could probably be more usable as a data source. It would also eliminate the package duplication for different platforms and arches.

epruesse commented 5 years ago

The code is committed, but I'm not enabling it.

Thanks!

For instance, there are 65.0% unique (e.g. not matched with other repos) packages in bioconda, even with perl-/r- normalization, and these are mostly python modules.

That's pretty much expected, at least for Bioconda. It exists because a lot of software relevant to Bioinformatics is not packaged with the major distros, and even if, the packages are too old. The language specific repos (CRAN, PyPi, CPAN, ...), OTOH don't work that nicely for things that require compilation, plus there is the issue of a software depending on things written in Perl/Python/R that isn't handled by either of those (and the motivation for conda).

There's also a lot of incorrect merges due to the same reason. This would be mostly useless for conda and harmful for other repos, so not acceptable.

You mean resolution of package names, right? Yes, that would be a problem. I honestly don't know how to resolve this.

Also, I'm now hesitant to add repositories which do not provide information on project homepages, as this information is crucial for distinguishing similarly names projects, which are becoming more numerous with each added repo.

Perhaps we can get someone at anaconda.org to help with this. What information specifically would you need to have in the repodata?

The only other thing I can offer is Bioconda specific. We parse all our meta.yamls to build our home page (bioconda.github.io) anyway, so I can put a digest with more information than provided by the repodata on that website. One catch here is though that it only considers the most current recipe in most cases, so without digging through the git or downloading all of our data from anaconda to look at the contained meta.yaml, I can't get at historic data.

A compilation of preprocessed meta.yaml files could probably be more usable as a data source.

Yes. Since conda-forge hosts it's build recipes each in a single git repo, while Bioconda hosts everything in one big repo, the process would have to be different though. If you are interested, I can provide you with the necessary code or data from the Bioconda side. E.g. we could have something in bioconda-utils (python module with CLI) that extracts what you need from a checkout of bioconda-recipes.

It would also eliminate the package duplication for different platforms and arches.

Yes and no. The meta.yaml format has, although young, already acquired quite a bit of legacy structure. It's no longer actually yaml. When writing the Bioconda "auto update" tool, I've had to mess with this...

Basically, the conda-build tool applies a lot of stuff to that meta.yaml to be able to execute matrix builds. From one meta.yaml, you can build for various architectures, but also for various versions of Perl/Python/Lua/R and versions of core libraries (boost, zlib, openssl, numpy, ...). This was originally achieved by having line based selectors (# [linux], # [not win], # [python >3.2 and not win]. To add power, Jinja2 processing was added, so recipes became more of a meta.yaml_t. With conda-build 3 yet more complexity was added, using e.g. {{ compiler('cxx') }} to inject whatever compiler settings the build setup of the distribution has for compiling c++ packages (including library dependencies). The result is a legacy laden, quite difficult to understand parsing process that yields any number of meta.yamls for a single recipe. conda-build has an API for this, but for reasons I have yet to understand regularly decides to actually download the source package to prepare the final set of meta.yaml structures. Running this for all packages might take quite a while.

That said - we can parse the raw pseudo yaml recipe to extract the information you need here. Would you prefer doing that with your own scripts, with scripts provided by the channel, or by accessing data we keep online for you?

AMDmi3 commented 5 years ago

or by accessing data we keep online for you?

This is actually the only option. With 200+ repos to maintain, I don't have ability to maintain additionak repository specific code, and using external utilities would hinder repology portability. With more detailed data on hand maybe I'll be able to find a way to separate python modules.

AMDmi3 commented 5 years ago

Here's an elaborate documentation of what Repology expects: https://repology.org/addrepo. Closing this issue, as there's nothing to do on Repology side until engaged parties publish data in suitable format.

awvwgk commented 2 years ago

I wonder whether conda-forge's own format of the repository at https://github.com/regro/libcfgraph would provide better access to the conda-forge repository than the repodata.json from anaconda.org does (which is known to miss a lot of the required meta data readily available in the actual conda-forge recipes).

There is also a mapping for all Python packages from PyPI to the conda-forge names available at https://github.com/regro/cf-graph-countyfair/blob/master/mappings/pypi/grayskull_pypi_mapping.yaml.

jayvdb commented 2 years ago

@awvwgk , that looks good. It seems to meet the current repology criteria

yarikoptic commented 2 years ago

there is also https://conda.anaconda.org/conda-forge/channeldata.json which has records e.g.

    "datalad": {
      "activate.d": false,
      "binary_prefix": false,
      "deactivate.d": false,
      "description": "DataLad aims to make data management and data distribution more accessible. To do that it stands on the shoulders of Git and Git-annex to deliver a decentralized system for data exchange. This includes automated ingestion of data from online portals, and exposing it in readily usable form as Git(-annex) repositories, so-called datasets. The actual data storage and permission management, however, remains with the original data providers.",
      "dev_url": "https://github.com/datalad/datalad",
      "doc_url": "http://datalad.readthedocs.io/",
      "home": "http://datalad.org",
      "license": "MIT",
      "post_link": false,
      "pre_link": false,
      "pre_unlink": false,
      "run_exports": {},
      "source_url": "https://pypi.io/packages/source/d/datalad/datalad-0.17.6.tar.gz",
      "subdirs": [
        "linux-64",
        "noarch",
        "osx-64",
        "win-64"
      ],
      "summary": "data distribution geared toward scientific datasets",
      "text_prefix": true,
      "timestamp": 1664986525,
      "version": "0.17.6"
    },

where name is the key and source_url should uniquely identify all sources. Most of sources come from PyPI:

❯ grep source_url channeldata.json | sed -n -e '/source_url"/s,.*source_url": "\(.*://[^/]*\)/.*"\,*,\1,gp'| sort | uniq -c | sort -n | tail
     40 https://git.ligo.org
     53 https://gitlab.com
     54 https://software.igwn.org
     71 https://www.x.org
     95 https://cpan.metacpan.org
    114 https://pypi.org
    144 https://rubygems.org
    223 http://msys2-sources.continuum.io
   3632 https://github.com
   9753 https://pypi.io

but PyPI on its own is disabled ATM) :-/ somewhat related good news is that for each {name} there AFAIK should be a github repo with https://github.com/conda-forge/{name}-feedstock/blob/main/recipe/meta.yaml which would prove more of possibly desired metadata (e.g. checksums for sources if could be used, dunno)

PS actually packages seems are provided information about PyPI e.g. https://repology.org/project/python:lazy-loader/versions and even our https://repology.org/project/python:datalad/versions -- although correct would be to unite python:datalad and datalad I guess since they are "the same thing" really.

AMDmi3 commented 2 years ago

there is also https://conda.anaconda.org/conda-forge/channeldata.json which has records e.g.

This is definitely better than repodata, still the main problem remains.

where name is the key and source_url should uniquely identify all sources

Not reliable. Example:

    "base58": {
      "activate.d": false,
      "binary_prefix": false,
      "deactivate.d": false,
      "home": "https://github.com/keis/base58",
      "license": "MIT",
      "post_link": false,
      "pre_link": false,
      "pre_unlink": false,
      "run_exports": {},
      "source_url": "https://github.com/keis/base58/archive/v2.1.1.tar.gz",
      "subdirs": [ 
        "noarch"
      ],
      "summary": "Base58 and Base58Check implementation",
      "text_prefix": false,
      "timestamp": 1635724257,
      "version": "2.1.1"
    },
AMDmi3 commented 2 years ago

Let me repeat myself: for Repology to support conda, there needs to be a single JSON file which makes python modules reliably distinguishable. Multi-gigabyte (compressed!) repositories, third party package name mappings or fetching an additional file (which on top of that is not self-contained templated yaml which cannot even be expanded) per each package are absolutely not acceptable.

dholth commented 1 year ago

This is what meta.yaml looks like after it has been rendered into json, which will be unique per package file (so name version platforms copies of same). Obviously not accessible to repology, but easier to parse?

{
    "package": {
        "name": "numpy-base",
        "version": "1.11.3"
    },
    "source": {
        "patches": [
            "disable_einsum_int16_test.patch",
            "fortran_regex.patch",
            "gfortran_alias.patch",
            "mklfft.patch"
        ],
        "sha256": "956afdeb9b5600e873326e410e9379684dac8f8f47ea569151a417984e7799cf",
        "url": "https://github.com/numpy/numpy/archive/v1.11.3.tar.gz"
    },
    "build": {
        "force_use_keys": [
            "python"
        ],
        "noarch": false,
        "number": "11",
        "script": "install_base.sh",
        "string": "py36h2f8d375_11"
    },
    "requirements": {
        "build": [
            "binutils_impl_linux-32 2.31.1 he3168a9_1",
            "binutils_linux-32 2.31.1 he3168a9_3",
            "gcc_impl_linux-32 7.3.0 he2ea625_1",
            "gcc_linux-32 7.3.0 hd2c3c17_3",
            "gfortran_impl_linux-32 7.3.0 h9268252_1",
            "gfortran_linux-32 7.3.0 hd2c3c17_3",
            "libgcc-ng 8.2.0 h9268252_1",
            "libgfortran-ng 7.3.0 h9268252_0",
            "libstdcxx-ng 8.2.0 h9268252_1"
        ],
        "host": [
            "blas 1.0 openblas",
            "ca-certificates 2018.03.07 0",
            "certifi 2018.11.29 py36_0",
            "cython 0.29 py36he6710b0_0",
            "libedit 3.1.20170329 h6b74fdf_2",
            "libffi 3.2.1 h97ff0df_4",
            "libgcc-ng 8.2.0 h9268252_1",
            "libgfortran-ng 7.3.0 h9268252_0",
            "libopenblas 0.3.3 h5a2b251_3",
            "libstdcxx-ng 8.2.0 h9268252_1",
            "ncurses 6.1 he6710b0_1",
            "nomkl 3.0 0",
            "openblas-devel 0.3.3 3",
            "openssl 1.1.1a h7b6447c_0",
            "python 3.6.7 h0371630_0",
            "readline 7.0 h7b6447c_5",
            "setuptools 40.6.2 py36_0",
            "sqlite 3.25.3 h7b6447c_0",
            "tk 8.6.8 hbc83047_0",
            "xz 5.2.4 h14c3975_4",
            "zlib 1.2.11 h7b6447c_3"
        ],
        "run": [
            "blas * openblas",
            "libgcc-ng >=7.3.0",
            "libgfortran-ng >=7,<8.0a0",
            "libopenblas >=0.3.3,<1.0a0",
            "python >=3.6,<3.7.0a0"
        ]
    },
    "test": {
        "commands": [
            "test -e $SP_DIR/numpy/distutils/site.cfg"
        ]
    },
    "extra": {
        "copy_test_source_files": true,
        "final": true,
        "recipe-maintainers": [
            "jakirkham",
            "msarahan",
            "ocefpaf",
            "pelson",
            "rgommers"
        ]
    }
}
AMDmi3 commented 1 year ago

The format itself is parsable, but the distribution of these would not necessarily be. The size and the number of entries could pose a problem. And I still see no markers of a python module.

dholth commented 1 year ago

It looks like that one is a bit old, the newer ones have better about sections like

    "about": {
        "description": "NumPy is the fundamental package needed for scientific computing with Python.\n",
        "dev_url": "https://github.com/numpy/numpy",
        "doc_source_url": "https://github.com/numpy/numpy/tree/main/doc",
        "doc_url": "https://numpy.org/doc/stable/reference/",
        "home": "https://numpy.org/",
        "license": "BSD-3-Clause",
        "license_family": "BSD",
        "license_file": "LICENSE.txt",
        "license_url": "https://github.com/numpy/numpy/blob/main/LICENSE.txt",
        "summary": "Array processing for numbers, strings, records, and objects."
    },

We'll have to see about the pypi link, it has a "run" dependency on python at least. More ordinary packages like sqlalchemy just link to pypi for their source code.

AMDmi3 commented 1 year ago

We'll have to see about the pypi link, it has a "run" dependency on python at least. More ordinary packages like sqlalchemy just link to pypi for their source code.

Neither of these is reliable still.

jaimergp commented 1 year ago

conda-forge bots maintain a PyPI mapping at https://github.com/regro/cf-graph-countyfair/blob/master/mappings/pypi/grayskull_pypi_mapping.json. The logic is defined in this module.

If a PyPI package is in conda-forge, it's contained in this mapping. Right now the logic is a bit too strict (it requires a PyPI source), but I am willing to extend this to do further checks if required. Would that be enough to identify Python modules (i.e. the package is already on PyPI) with sufficient accuracy? Thanks!

AMDmi3 commented 1 year ago

https://github.com/repology/repology-updater/issues/518#issuecomment-1276690149