Website: https://data.sfb1451.de
Repository overview The catalog is placed in the =docs= directory (to allow other repository content at root level), and served to GitHub pages from there.
Related repositories:
SFB superdataset: [[https://github.com/sfb1451/all-projects][sfb1451/all-projects]]
Utilities for handling tabby files: [[https://github.com/sfb1451/tabby-utils][sfb1451/tabby-utils]]
Custom extractors & translators: [[https://github.com/mslw/datalad-wackyextra][mslw/datalad-wackyextra]]
Required extensions The following DataLad extensions are required to generate the catalog: [[https://github.com/datalad/datalad-next][DataLad-next]] is used to interact with WebDAV remotes, [[https://github.com/datalad/datalad-metalad][DataLad-metalad]] provides metadata extraction, [[https://github.com/mslw/datalad-wackyextra][DataLad-wackyextra]] provides custom metadata extractors and translators, [[https://github.com/datalad/datalad-catalog][DataLad-catalog]] is used to generate the catalog.
Dataset locations Project datasets are tracked in the =Projects= superdataset. GitHub mirror: https://github.com/sfb1451/all-projects
** Obtaining a subdataset from Sciebo:
datalad clone -d . webdavs://
(using project names as folder names, because these would be displayed as subdataset names).
Tip: until [[https://github.com/datalad/datalad-next/issues/108][datalad-next/issues/108]] makes it automatic, storage remote can be reconfigured with:
git annex initremote my-sciebo-storage --private --sameas
or with clone url substitution - check:
datalad configuration | grep 'datalad.clone.url-substitute'
to get some examples how you can alter all such URLs at once.
Generation is done incrementally. Utility scripts are provided in the code directory. One script usually does one thing.
Scripts have similar CLI: they need to be pointed to a dataset, a folder to store intermediate metadata, and (optionally) to a catalog they will update. See command line help or source code for usage instructions.
Key scripts are:
=extract_selected.py=: extract metadata from a project's subdataset. It allows selection of extractors to be used. Can optionally generate a file list, too.
=extract_project.py=: extract metadata from a project's superdataset. Conducts =project_extract_pipeline.json= and translates the result. This pipeline combines metalad_core, CFF, and bibliography (ris/nbib/crossref) extractors. It also adds handcrafted metadata, like funding or keywords.
=extract_superdataset=: extract superdataset metadata. Runs =metalad_core= and =metalad_studyminimeta= extractors.
Additional scripts are:
=utils.py=: shared functions or classes
=list_files.py=: generate a file listing based on =datalad status=. Deprecated, use =extract_selected.py --files=.
=scrape_projects.py=: archived script that was used to scrape project descriptions.
=scrape_publications.py=: archived script that was used to scrape project publications.