Open jsheunis opened 2 years ago
In the /archive/www
subdataset, is the regular debian package dist/pool data structure. Among other things it has the full list of included packages and their versions, See for example https://neuro.debian.net/debian/dists/bullseye/main/binary-amd64/Packages
So, to summarise my understanding, we'd be able to find multi-level metadata from different sources, including the package
datalad datasets themselves as well as from a distribution
level "Packages" file.
This suggests that it might be useful to have an extractor for the "Packages" file, but raises the question of what the resulting metadata will look like and how/where it will exist, since it references multiple packages that might already be linked as subdatasets of the relevant distribution. I.e., should we generate separate package-specific metadata items from such a "Packages" file, and should these items reference the specific datalad_id
of the package
datalad dataset that they relate to?
Or should all of the info extracted from a "Packages" file just remain distribution dataset level metadata?
For representing in something like a catalog.
Catalog entrypoint
A relevant question is where the entrypoint for such a catalog would/could be. This has been discussed in #11. See specifically the comment: https://github.com/psychoinformatics-de/datalad-debian/issues/11#issuecomment-1179742489. This suggests the
archive
dataset to be the entrypoint for a catalog, although this does not necessarily have to be the case for the relevant workflow to generate metadata.Workflow entrypoint
Let's say we assume a recursive super-sub-dataset hierarchy as follows:
(note that
as explained in this comment, but this is ignored in the short term).
So, for metadata extraction, where does metadata related to any particular type of dataset come from?
And can this information be extracted using a dataset- or file-level extractor with metalad? If we start from the bottom (kind of):
distribution
andpackage
datasets.(PS. it is assumed that
metalad_core
extractor will be run on the dataset and file level for all datasets in this hierarchy in order to be able to represent dependent dataset linkage as well as file trees).Taking these levels of metadata into account, it could be straight forward to run a workflow that traverses the hierarchy in a top-down direction and extracts relevant dataset- and file-level metadata at each level.
Examples of relevant WIP implementations or related issues:
Open questions
runprov
extractor?) and to represent in a catalog.