Closed epruesse closed 5 years ago
Thanks for the suggestion, looks doable.
I don't like that it does contain a lot of python modules not distinguishable from other software. We already have this problem with some other repos.
Well, they would almost all be recognizable by having pypi in the source url. To get that, you'd have to parse the recipes though:
At least conda-forge and bioconda are built from recipes in maintained in github repositories, similar to brew
based distributions. You could parse the recipes additionally/alternatively. (Conda-forge has a repo-per-package structure, but conda-forge/feedstocks has submodules for each of those, so recursive checkout would get you all of the current ones).
Each recipe has a meta.yaml
which contains in source: url:
the URL to the source archive. Conda-forge also keeps the maintainer in extra: maintainers:
(github user names), Bioconda does not enforce this.
Code for parsing the meta.yaml
exists in python in various places (bioconda uses bioconda/bioconda-utils conda-forge uses conda-forge/conda-smithy to collect tools); both rely on conda/conda-build to understand the actual package.
(I'm guessing that parsing the source URL is the most reliable way to distinguish packages named differently in various distros. Sets of versions would give additional info, or keywords).
FYI: conda also has a somewhat queer feature called "features". For each package-version-architecture combination, there still may be more than one binary.
These are tracking "features". E.g. py36
or py27
variants of a package might be present if the package contains binary code that needs to match the python version. Other examples are numpy111py36
to indicate something compiled for a specific numpy/python version combination, or r3.3.2
to indicate that it was build for a specific version of R. Conda-forge uses vc9
, vc10
and vc14
to distinguish binaries using the visual studio runtimes 2008, 2010 and 2015, respectively.
You can see the features in both the package name and the "build" value in the repodata.json. The "build" value is string is a underscore separated list of features and the build number.
Yes, I've thought of using build
values, but I don't think they are clear indication of python modules, e.g. they are likely set for python apps as well. Also I've seen e.g. python and octave features at the same time, which is even more confusing.
Could also try parsing recipes, I didn't go this way because I've seen separate repos for each recipe, but if there's single repo which may be used to check them out all at once it may be worth trying.
Yes, I've thought of using build values, but I don't think they are clear indication of python modules, e.g. they are likely set for python apps as well. Also I've seen e.g. python and octave features at the same time, which is even more confusing.
They also should not be present for pure python modules marked as noarch: python
.
Could also try parsing recipes, I didn't go this way because I've seen separate repos for each recipe, but if there's single repo which may be used to check them out all at once it may be worth trying.
You can find get all conda-forge recipes here: https://github.com/conda-forge/feedstocks - it just needs a lengthy recursive checkout.
To get all information, you may have to parse both. The repodata.json
will list all available packages, while the recipe repositories contain only the recipes for the current version(s), but do have extra information such as maintainer, source URL, etc.
I would love for this to be bumped to a higher priority. FreeCAD relies heavily on conda for it's development builds.
I don't like that it does contain a lot of python modules not distinguishable from other software. We already have this problem with some other repos.
By now I believe most R
and Perl
modules are recognizable by beginning with r-
and perl-
, respectively. Due to conda
's history as an extension of sorts to pip
, there is no separate python namespace, however. And as with all package repositories, nothing really enforces the naming schemes that exist.
Here's a little more info on the repodata.json
:
URL template: https://conda.anaconda.org/{channel}{/label}/{subdir}/repodata.json
So we've got the channel, which is effectively a distribution unto itself. E.g. conda-forge
or bioconda
. The label, which can be e.g. /label/main
or /label/broken
or /label/gcc7
and is effectively a sub-distribution. And then the subdir
, which is effectively the architecture/platform. It should always match (noarch|(linux|osx|win)-(32|64)
, although not all variants are offered in each channel. noarch
contains non-binary packages such as pure python, R, perl or lua or meta packages that will work on any architecture.
In the repodata, the key repodata_version
may be present. If it is one, there will be a removed
key containing removed packages. The info
key will always contain the subdir
, the other information has changed between versions.
Of most interest is the package
key, which contains keys for every downloadable filename (URL next to repodata). For each file, we have name
, version
, build_number
indicating the fully qualified package. The build
key is a string containing the build number and a hash of the pinned dependencies. This will e.g. change if the version of boost a package depends on changes and allows having a collection of binaries for each name/version/build_number compiled to match different libraries.
The keys 'arch' and 'platform' I'd ignore as they aren't consistently available and not used by conda itself
.
The key depends
contains a list of strings indicating the packages and versions required at runtime.
To get more information you'd have to parse the meta.yaml
describing the package. Since each channel keeps these in different places, and usually in a git, the easiest way to consistently get this file would mean downloading the package. It's in info/recipe/meta.yaml
, and contains all data from repodata.json/packages
and further information such as the project home page, license, summary, maintainers, etc.
The code is committed, but I'm not enabling it. For instance, there are 65.0% unique (e.g. not matched with other repos) packages in bioconda, even with perl-
/r-
normalization, and these are mostly python modules. There's also a lot of incorrect merges due to the same reason. This would be mostly useless for conda and harmful for other repos, so not acceptable.
Also, I'm now hesitant to add repositories which do not provide information on project homepages, as this information is crucial for distinguishing similarly names projects, which are becoming more numerous with each added repo.
A compilation of preprocessed meta.yaml
files could probably be more usable as a data source. It would also eliminate the package duplication for different platforms and arches.
The code is committed, but I'm not enabling it.
Thanks!
For instance, there are 65.0% unique (e.g. not matched with other repos) packages in bioconda, even with
perl-
/r-
normalization, and these are mostly python modules.
That's pretty much expected, at least for Bioconda. It exists because a lot of software relevant to Bioinformatics is not packaged with the major distros, and even if, the packages are too old. The language specific repos (CRAN, PyPi, CPAN, ...), OTOH don't work that nicely for things that require compilation, plus there is the issue of a software depending on things written in Perl/Python/R that isn't handled by either of those (and the motivation for conda).
There's also a lot of incorrect merges due to the same reason. This would be mostly useless for conda and harmful for other repos, so not acceptable.
You mean resolution of package names, right? Yes, that would be a problem. I honestly don't know how to resolve this.
Also, I'm now hesitant to add repositories which do not provide information on project homepages, as this information is crucial for distinguishing similarly names projects, which are becoming more numerous with each added repo.
Perhaps we can get someone at anaconda.org to help with this. What information specifically would you need to have in the repodata?
The only other thing I can offer is Bioconda specific. We parse all our meta.yaml
s to build our home page (bioconda.github.io) anyway, so I can put a digest with more information than provided by the repodata on that website. One catch here is though that it only considers the most current recipe in most cases, so without digging through the git or downloading all of our data from anaconda to look at the contained meta.yaml, I can't get at historic data.
A compilation of preprocessed
meta.yaml
files could probably be more usable as a data source.
Yes. Since conda-forge hosts it's build recipes each in a single git repo, while Bioconda hosts everything in one big repo, the process would have to be different though. If you are interested, I can provide you with the necessary code or data from the Bioconda side. E.g. we could have something in bioconda-utils
(python module with CLI) that extracts what you need from a checkout of bioconda-recipes
.
It would also eliminate the package duplication for different platforms and arches.
Yes and no. The meta.yaml
format has, although young, already acquired quite a bit of legacy structure. It's no longer actually yaml. When writing the Bioconda "auto update" tool, I've had to mess with this...
Basically, the conda-build
tool applies a lot of stuff to that meta.yaml to be able to execute matrix builds. From one meta.yaml
, you can build for various architectures, but also for various versions of Perl/Python/Lua/R and versions of core libraries (boost, zlib, openssl, numpy, ...). This was originally achieved by having line based selectors (# [linux]
, # [not win]
, # [python >3.2 and not win]
. To add power, Jinja2 processing was added, so recipes became more of a meta.yaml_t
. With conda-build 3
yet more complexity was added, using e.g. {{ compiler('cxx') }}
to inject whatever compiler settings the build setup of the distribution has for compiling c++ packages (including library dependencies). The result is a legacy laden, quite difficult to understand parsing process that yields any number of meta.yaml
s for a single recipe. conda-build
has an API for this, but for reasons I have yet to understand regularly decides to actually download the source package to prepare the final set of meta.yaml
structures. Running this for all packages might take quite a while.
That said - we can parse the raw pseudo yaml recipe to extract the information you need here. Would you prefer doing that with your own scripts, with scripts provided by the channel, or by accessing data we keep online for you?
or by accessing data we keep online for you?
This is actually the only option. With 200+ repos to maintain, I don't have ability to maintain additionak repository specific code, and using external utilities would hinder repology portability. With more detailed data on hand maybe I'll be able to find a way to separate python modules.
Here's an elaborate documentation of what Repology expects: https://repology.org/addrepo. Closing this issue, as there's nothing to do on Repology side until engaged parties publish data in suitable format.
I wonder whether conda-forge's own format of the repository at https://github.com/regro/libcfgraph would provide better access to the conda-forge repository than the repodata.json
from anaconda.org does (which is known to miss a lot of the required meta data readily available in the actual conda-forge recipes).
There is also a mapping for all Python packages from PyPI to the conda-forge names available at https://github.com/regro/cf-graph-countyfair/blob/master/mappings/pypi/grayskull_pypi_mapping.yaml.
@awvwgk , that looks good. It seems to meet the current repology criteria
there is also https://conda.anaconda.org/conda-forge/channeldata.json which has records e.g.
"datalad": {
"activate.d": false,
"binary_prefix": false,
"deactivate.d": false,
"description": "DataLad aims to make data management and data distribution more accessible. To do that it stands on the shoulders of Git and Git-annex to deliver a decentralized system for data exchange. This includes automated ingestion of data from online portals, and exposing it in readily usable form as Git(-annex) repositories, so-called datasets. The actual data storage and permission management, however, remains with the original data providers.",
"dev_url": "https://github.com/datalad/datalad",
"doc_url": "http://datalad.readthedocs.io/",
"home": "http://datalad.org",
"license": "MIT",
"post_link": false,
"pre_link": false,
"pre_unlink": false,
"run_exports": {},
"source_url": "https://pypi.io/packages/source/d/datalad/datalad-0.17.6.tar.gz",
"subdirs": [
"linux-64",
"noarch",
"osx-64",
"win-64"
],
"summary": "data distribution geared toward scientific datasets",
"text_prefix": true,
"timestamp": 1664986525,
"version": "0.17.6"
},
where name is the key and source_url
should uniquely identify all sources. Most of sources come from PyPI:
❯ grep source_url channeldata.json | sed -n -e '/source_url"/s,.*source_url": "\(.*://[^/]*\)/.*"\,*,\1,gp'| sort | uniq -c | sort -n | tail
40 https://git.ligo.org
53 https://gitlab.com
54 https://software.igwn.org
71 https://www.x.org
95 https://cpan.metacpan.org
114 https://pypi.org
144 https://rubygems.org
223 http://msys2-sources.continuum.io
3632 https://github.com
9753 https://pypi.io
but PyPI on its own is disabled ATM) :-/ somewhat related good news is that for each {name}
there AFAIK should be a github repo with https://github.com/conda-forge/{name}-feedstock/blob/main/recipe/meta.yaml which would prove more of possibly desired metadata (e.g. checksums for sources if could be used, dunno)
PS actually packages seems are provided information about PyPI e.g. https://repology.org/project/python:lazy-loader/versions and even our https://repology.org/project/python:datalad/versions -- although correct would be to unite python:datalad
and datalad
I guess since they are "the same thing" really.
there is also https://conda.anaconda.org/conda-forge/channeldata.json which has records e.g.
This is definitely better than repodata, still the main problem remains.
where name is the key and source_url should uniquely identify all sources
Not reliable. Example:
"base58": {
"activate.d": false,
"binary_prefix": false,
"deactivate.d": false,
"home": "https://github.com/keis/base58",
"license": "MIT",
"post_link": false,
"pre_link": false,
"pre_unlink": false,
"run_exports": {},
"source_url": "https://github.com/keis/base58/archive/v2.1.1.tar.gz",
"subdirs": [
"noarch"
],
"summary": "Base58 and Base58Check implementation",
"text_prefix": false,
"timestamp": 1635724257,
"version": "2.1.1"
},
Let me repeat myself: for Repology to support conda, there needs to be a single JSON file which makes python modules reliably distinguishable. Multi-gigabyte (compressed!) repositories, third party package name mappings or fetching an additional file (which on top of that is not self-contained templated yaml which cannot even be expanded) per each package are absolutely not acceptable.
This is what meta.yaml looks like after it has been rendered into json, which will be unique per package file (so name version platforms copies of same). Obviously not accessible to repology, but easier to parse?
{
"package": {
"name": "numpy-base",
"version": "1.11.3"
},
"source": {
"patches": [
"disable_einsum_int16_test.patch",
"fortran_regex.patch",
"gfortran_alias.patch",
"mklfft.patch"
],
"sha256": "956afdeb9b5600e873326e410e9379684dac8f8f47ea569151a417984e7799cf",
"url": "https://github.com/numpy/numpy/archive/v1.11.3.tar.gz"
},
"build": {
"force_use_keys": [
"python"
],
"noarch": false,
"number": "11",
"script": "install_base.sh",
"string": "py36h2f8d375_11"
},
"requirements": {
"build": [
"binutils_impl_linux-32 2.31.1 he3168a9_1",
"binutils_linux-32 2.31.1 he3168a9_3",
"gcc_impl_linux-32 7.3.0 he2ea625_1",
"gcc_linux-32 7.3.0 hd2c3c17_3",
"gfortran_impl_linux-32 7.3.0 h9268252_1",
"gfortran_linux-32 7.3.0 hd2c3c17_3",
"libgcc-ng 8.2.0 h9268252_1",
"libgfortran-ng 7.3.0 h9268252_0",
"libstdcxx-ng 8.2.0 h9268252_1"
],
"host": [
"blas 1.0 openblas",
"ca-certificates 2018.03.07 0",
"certifi 2018.11.29 py36_0",
"cython 0.29 py36he6710b0_0",
"libedit 3.1.20170329 h6b74fdf_2",
"libffi 3.2.1 h97ff0df_4",
"libgcc-ng 8.2.0 h9268252_1",
"libgfortran-ng 7.3.0 h9268252_0",
"libopenblas 0.3.3 h5a2b251_3",
"libstdcxx-ng 8.2.0 h9268252_1",
"ncurses 6.1 he6710b0_1",
"nomkl 3.0 0",
"openblas-devel 0.3.3 3",
"openssl 1.1.1a h7b6447c_0",
"python 3.6.7 h0371630_0",
"readline 7.0 h7b6447c_5",
"setuptools 40.6.2 py36_0",
"sqlite 3.25.3 h7b6447c_0",
"tk 8.6.8 hbc83047_0",
"xz 5.2.4 h14c3975_4",
"zlib 1.2.11 h7b6447c_3"
],
"run": [
"blas * openblas",
"libgcc-ng >=7.3.0",
"libgfortran-ng >=7,<8.0a0",
"libopenblas >=0.3.3,<1.0a0",
"python >=3.6,<3.7.0a0"
]
},
"test": {
"commands": [
"test -e $SP_DIR/numpy/distutils/site.cfg"
]
},
"extra": {
"copy_test_source_files": true,
"final": true,
"recipe-maintainers": [
"jakirkham",
"msarahan",
"ocefpaf",
"pelson",
"rgommers"
]
}
}
The format itself is parsable, but the distribution of these would not necessarily be. The size and the number of entries could pose a problem. And I still see no markers of a python module.
It looks like that one is a bit old, the newer ones have better about sections like
"about": {
"description": "NumPy is the fundamental package needed for scientific computing with Python.\n",
"dev_url": "https://github.com/numpy/numpy",
"doc_source_url": "https://github.com/numpy/numpy/tree/main/doc",
"doc_url": "https://numpy.org/doc/stable/reference/",
"home": "https://numpy.org/",
"license": "BSD-3-Clause",
"license_family": "BSD",
"license_file": "LICENSE.txt",
"license_url": "https://github.com/numpy/numpy/blob/main/LICENSE.txt",
"summary": "Array processing for numbers, strings, records, and objects."
},
We'll have to see about the pypi link, it has a "run" dependency on python at least. More ordinary packages like sqlalchemy just link to pypi for their source code.
We'll have to see about the pypi link, it has a "run" dependency on python at least. More ordinary packages like sqlalchemy just link to pypi for their source code.
Neither of these is reliable still.
conda-forge bots maintain a PyPI mapping at https://github.com/regro/cf-graph-countyfair/blob/master/mappings/pypi/grayskull_pypi_mapping.json. The logic is defined in this module.
If a PyPI package is in conda-forge, it's contained in this mapping. Right now the logic is a bit too strict (it requires a PyPI source), but I am willing to extend this to do further checks if required. Would that be enough to identify Python modules (i.e. the package is already on PyPI) with sufficient accuracy? Thanks!
conda
is a popular package manager in science. It installs packages into the user's home directory, supports "virtual environments" and custom "channels" (akin to ppa's). There are a few major channels that can be considered distributions in their own right, mainly bioconda and conda-forge.Links: https://conda.io/docs/ https://conda-forge.org/ https://bioconda.github.io/
Channel packages are hosted at anaconda.org, which also offers an API for querying available packages. A repo dump can be obtained here:
https://conda.anaconda.org/{channel}/{arch}-{bits}/repodata.json
where
{arch}
is one oflinux
,osx
andwin
, and bits one of32
and64
. (Not sure how much 32 is used, bioconda e.g. builds only linux-64 and osx-64).