Explore and implement multi-level metadata extraction and aggregation workflow

jsheunis commented 2 years ago

For representing in something like a catalog.

Catalog entrypoint

A relevant question is where the entrypoint for such a catalog would/could be. This has been discussed in #11. See specifically the comment: https://github.com/psychoinformatics-de/datalad-debian/issues/11#issuecomment-1179742489. This suggests the archive dataset to be the entrypoint for a catalog, although this does not necessarily have to be the case for the relevant workflow to generate metadata.

Workflow entrypoint

Let's say we assume a recursive super-sub-dataset hierarchy as follows:

archive
.
├── distribution
│   ├── builder
│   └── package
│       └── builder

(note that

For technical/legal reasons, this [archive] dataset may have some components organized in subdatasets (e.g., non-free)

as explained in this comment, but this is ignored in the short term).

So, for metadata extraction, where does metadata related to any particular type of dataset come from?

And can this information be extracted using a dataset- or file-level extractor with metalad? If we start from the bottom (kind of):

For package metadata, we have an issue (#30) for building a package metadata extractor. This is in the works in this fork+branch.
For builder metadata, there's a recent issue (#92) to create a builder metadata extractor (mainly from the singularity recipe). For the catalog specifically, this only has to be extracted once (per dataset version) and the catalog will take care of representing it as a linked subdataset of both the distribution and package datasets.
For distribution metadata, the source of metadata is an open question. Technically, this can also come from the extracted metadata of the builder, since there is a one-to-one relationship between a builder and distribution. This means we do not necessarily need a distribution-dataset extractor, and could use some sort of adapted aggregation/extraction process to have this info on the dirtibution-dataset level.
For archive metadata, what needs to be represented here?

(PS. it is assumed that metalad_core extractor will be run on the dataset and file level for all datasets in this hierarchy in order to be able to represent dependent dataset linkage as well as file trees).

Taking these levels of metadata into account, it could be straight forward to run a workflow that traverses the hierarchy in a top-down direction and extracts relevant dataset- and file-level metadata at each level.

Examples of relevant WIP implementations or related issues:

FAIRly big catalog workflow (this also includes metadata translation to the catalog schema - additional translators would have to be implemented for debian-related metadata)
datalad-catalog issue to create a python workflow for top-down dataset hierarchy metadata extraction: https://github.com/datalad/datalad-catalog/issues/91

Open questions

Are there any other sources of metadata on the distribution and archive level that isn't mentioned here
Is the important provenance information related to package builds contained within the package-level datalad dataset? Asking since this would be useful metadata to extract as well (using runprov extractor?) and to represent in a catalog.
Where should metadata be aggregated to?
Where should metadata be added and stored? Either for pure metadata storing purposes, or for the purpose of generating a catalog?

mih commented 2 years ago

In the /archive/www subdataset, is the regular debian package dist/pool data structure. Among other things it has the full list of included packages and their versions, See for example https://neuro.debian.net/debian/dists/bullseye/main/binary-amd64/Packages

jsheunis commented 2 years ago

So, to summarise my understanding, we'd be able to find multi-level metadata from different sources, including the package datalad datasets themselves as well as from a distribution level "Packages" file.

This suggests that it might be useful to have an extractor for the "Packages" file, but raises the question of what the resulting metadata will look like and how/where it will exist, since it references multiple packages that might already be linked as subdatasets of the relevant distribution. I.e., should we generate separate package-specific metadata items from such a "Packages" file, and should these items reference the specific datalad_id of the package datalad dataset that they relate to?

Or should all of the info extracted from a "Packages" file just remain distribution dataset level metadata?

psychoinformatics-de / datalad-debian