My view of the big picture - for discussion

Context

I would like to describe my view of the big picture of our extended data modelling effort, for a few reasons:

clarify which components I see forming part of the whole
explain how I understand the practical use of these components
give my thoughts on implementation details of these components

I aim to provide a means for others to validate their own understanding against, such that we can reason about these components and their purposes / uses / implementations, and also (dis)agree on the big picture level. These all play a role in what we decide to (individually) focus on next.

Firstly, why are we doing all of this?

We aim to allow description of a complete dataset (DataLad or not) using metadata, to such a level of detail that a DataLad dataset can be generated from it on-demand. This has several useful implications:

a dataset can be represented (publicly) using metadata
data publishers can use metadata to describe (i.e. pseudo-"create") a DataLad dataset without knowing or using any DataLad, and without having to support DataLad/git on their server or storage infrastructure or runtime environments
data consumers can gain DataLad-based access (and all the associated benefits) to datasets without a DataLad dataset existing on the other end

Secondly, why are we doing all of this using linked data

From my POV, there are just too many benefits:

many people / groups / communities have already done much of what we want to, there's no point in reinventing the wheel; rather build on it
there are existing ontologies and vocabularies that already describe the concepts involved in our field of work, i.e. dataset properties (DCAT), provenance (PROV), and many more; let's incorporate that
there are so many existing linked data standards (json-ld, rdf, shacl, etc) and loads of widely used open source tools that become available to us
the semantic web is ubiquitous online; by building on these standards and work we bring increased interoperability with the wider "data world" into the DataLad realm
(DataLad) datasets will be able to form part of any linked data graph database and can hence be queried using widely used web tools
All in all, increasing the FAIRness of data, while keeping the security of (personal) data safely in the hands of the original data controller

So what do we need (and want)?

Concept and structure (a.k.a. ontology and schema)

In order to describe a complete dataset using linked metadata, we need (as @mih has pointed out) concepts and structure:

concepts involve terms with unique and publicly resolvable definitions that can ideally build on (or be linked to) existing terms in the semantic web
structure involves a hierarchical description of how such terms fit together (and how they are constrained) to form a complete dataset description.

As a simple example, we can say that we prefer the general term keyword as defined by the DCAT ontology (the concept) and that we want a dataset to have such a keyword property that should take the form of a list of strings with a maximum of 10 elements (the structure). These components have taken the form of the concepts ontology and dataset-related schemas for which @mih has done extensive work using LinkML, and the result of which can be seen in https://github.com/psychoinformatics-de/datalad-concepts. We currently have schemas for both DataLad dataset components as well as a general dataset-version.

Let's say the schemas are in working order, what is then left to do? And what else might we want to do?

Dataset generation

This is our main goal, as defined internally in RFD0041. We have to create code that transforms metadata from a state that is valid according to our schema of a (datalad) dataset into an actual DataLad dataset. We want to be able to, for example, run datalad clone ... locally and point it to a location hosting a metadata-based dataset description, and then DataLad (-next) should do the rest so that we end up having a local DataLad dataset that we can run any DataLad command on.

My thoughts / comments:

are our schemas currently in a state where we can actually start prototyping this functionality?
if not, what do we still need wrt the two existing schemas to allow this?
part of generating the dataset on-demand would be to first validate the metadata being pointed to; against which schema should this be validated? what if it's valid according to the dataset-version schema but not the datalad-dataset-components schema, should our updated datalad clone code support translation? (see seperate point on translation below)

Automatic form generation

This was not an explicit part of the original goal behind a linked metadata descriptor of a dataset (which is metadata-based dataset generation), but has always been viewed as another important benefit that is enabled when a dataset is modeled with linked data concepts and given structure via a schema. From such a schema (in whichever format), one can use code to generate a (web-based) metadata entry form. This can be a powerful tool.

Why? Well, firstly, the alternative sheet-based entry (e.g. Excel or Google sheets), while ubiquitous, is error prone as it contains no automatic validation of the entered objects and properties of a dataset. On the other hand, a form generated from a schema has the possibility of embedding all constraints that are already built into the schema. Think: required fields, regex patterns for particular fields, min or max items of an array, data type validations. It's great if these can be validated on entry by the user, rather than having to check metadata entries by hand after collectionm, or writing round-about code that parses sheets and then does validation. It's no good having to go back to the user afterwards to ask them to fix their entry, likely not even possible.

Secondly, this form entry system can fit well into an automated pipeline for metadata collection and storage. For example: user enters metadata into the web-based form --> metadata fields are validated on entry --> user selects Submit --> complete metadata object is posted to some endpoint, where the next step in the pipeline can handle it further (with the end result being e.g. a catalog entry or a metadata graph database entry, or DataLad dataset generation).

I have spent some time exploring the space of automatic form generation from schemas, roughly in the following order:

Starting with jsonschema, given my existing familiarity with it because of its use in datalad-catalog. This mainly involved exploration of JSON-editor (with our edits at https://github.com/psychoinformatics-de/json-editor) because of how well supported and widely used it is.
Followed by a write-up of my then-understanding of the problem-space of designing apps for schema-based automatic form generation: https://github.com/psychoinformatics-de/org/issues/281
Followed by an exploration of shacl, and the DASH extension for forms and UI components. For details, see: https://github.com/psychoinformatics-de/datalad-concepts/issues/113

My thoughts / comments:

whereas jsonschema is more familiar and perhaps more entry-level friendly, shacl is more intergated with the semantic web. Compare the fact that there is no inherent way for jsonschema to support json-ld (or no "json-ld schema"), while shacl is authored using linked data standards (prefixes, curies, triples, etc).
shacl is a more recent standard, and I could not find many widely used and supported open source tools for form generation from shacl shapes. The best fit was shaperone, which I have yet to explore in more detail. This points to the (IMO likely) possibility that we'll have to build such a tool ourselves, if we opt for shacl.
taking a quick step back, from which schema would a given form be generated? Take this example of a form created from jsonschema that was generated from our LinkML-based dataset-version schema. This form is overly complex and uses fields and descriptions that would confuse any user, while the whole point of a web-based form is for it to be as intuitive and easy as possible. Does this mean we should have a separate schema for form generation, that is somehow linked to (or inherits from) the general dataset-version schema. One that would allow entered data to be translated easily into the desired schema structure afterwards? (see separate point on translation below)
an important technical aspect that needs to be addressed if we continue using LinkML to generate e.g. shacl from the original schema format, is that any annotations (but specifically those providing structural info for form-generators or other user interfaces) should flow through to the output format during schema generation. Currently they mostly don't: https://github.com/linkml/linkml/issues/1618

Catalog rendering

The current state of datalad-catalog is steady. We have a couple of catalogs in production. Some new features are added to the package and the production instances fairly regularly. We have built up good experience with the catalog generation process, catalog schemas, rendering components with VueJS, URL routing, and the general catalog maintenance process. The new goal would be update all of this to account for our focus on linked data technologies.

We have a couple of open issues that started diving into this challenge, although not yet with any significant insights or achievements:

My thoughts / comments:

after exploring shacl and DASH in the context of form generation, I realised that the challenge with a catalog rendering is quite similar - it is basically viewing (catalog) versus entering/editing (form) the same metadata with the same structure. This means that, conceptually, a shacl export of (for example) the dataset-version schema, together with valid data, could just as well be used for catalog rendering as for form generation. There would be a difference, though, if we explicitly opt for different but related schemas for the purpose of making data entry easier (refer to the point above about the complexity of the example jsonschema form). We could even decide to have a different schema altogether for the catalog, compared to the data entry schema, compared to the dataset-version schema
this leads to the next point: what should the schema for a catalog look like? Should it even have a dedicated schema? Or should it rather be as general as possible and focus on component-based rendering of very general dataset (and related) concepts? An idea (that I like) would be to tag onto the DCAT3 vocabulary for describing datasets (just as @mih did with our existing ontology concepts) and build the general structure of a dataset in a catalog according to a DCAT dataset.
what about rendering? My feeling is that we should support a high-level layout of the idea of a dataset, and then focus on "component-based renderers" for detailed parts.
- With "high-level layout" I mean things that are usually taken as a given, e.g. that the title of a dataset is usually on top and shown with large font-size, or that the way in which authors are displayed in a list. These high-level components could be taken directly from DCAT, or from out dataset-version schema (inheriting from DCAT).
- Another useful source of specification for "high-level layout" can come from the shacl properties sh:group and sh:order, which one can initially provide as annotations in a catalog (or form) schema. This allows the person building the schema to specify which groups should be displayed in a layout, which properties should form part of which groups, and in which order the properties internal to a group (and groups relative to each other) should be displayed. This points again to the usefulness of using shacl for both form generation and catalog entry rendering.
- With "component-based renderers" I mean that we should build modular components for rendering specific objects or data types, and that allows for extensibility. E.g. a VueJS (or vanilla javascript) component for displaying a prov:Agent or a schema:Person or a datetime variable or a generic relation that a dataset to an Entity, defined by a specific predicate.
In this way, the catalog can support some type of matching of a given metadata field (or triple) with a specific rendering component from a pool of rendering components. If it doesn't find a match, there can be a generic renderer component, and there could be a config that specifies whether to select the default generic rendering or rather to display nothing at all.

The polyglot: translation from anything to anything

Some points above have referred to this translation section, and other unmentioned points also provide context:

If we want to datalad clone from metadata files (i.e. on-demand dataset generation) what schema should the metadata be validated against? What if it's valid according to the dataset-version schema but not the datalad-dataset-components schema, should we first translate? Or should we accept any linked metadata and try to see if we can find relevant/required fields for dataset generation somewhere in the whole metadata object?
Consider the process of metadata entry into a form, and curating this collected metadata into something that can be represented in a catalog, and from which a DataLad dataset can be generated: we can see that all steps in the workflow are related to the same entity (the data being described), but each component (i.e. form, dataset generation, catalog entry) might actualy have a separate schema. We need to support seamless translation between these schemas, otherwise the workflow wont flow.
What if we want to represent datasets already described with common standards in our own context (either as something that can be generated into a DataLad dataset, or represented in a catalog, or whatever), we need translations from those standards into our own. E.g. datacite, dataverse, any of the multitudes online.

All of these points suggest that we should have some form of translating between metadata instances / schemas.

My thoughts:

translation between everything suggests that we need to find a common language. My understanding is that RDF is that common language. The essence of all our work in this realm of linked data is to be able to describe data firstly with resolvable and standard terms, and secondly using the triple format of object--predicate--subject. And LinkML supports converting our data to RDF.
Once our data, different data objects being valid against different schemas, are in RDF format, the concept of "semantic reasoning" can come into play. Reasoning is known in the world of linked data as "the ability of a system to infer new facts from existing data based on inference rules or ontologies", see for example:
- https://en.wikipedia.org/wiki/Semantic_reasoner
- https://rubenverborgh.github.io/Semantic-Web-Reasoning/#cover
I am mostly clueless about how this will work in practice for our purposes, but the part that I feel is important to mention is the idea of translating between properties or relationships that are seen as "symmetric" (see https://www.w3.org/TR/owl-ref/#SymmetricProperty-def). If datacite names a property title and our schema names a property name, and we decide that these are equal for translation purposes, we should be able to describe this symmetry, run the datacite metadata through the quote-unquote "reasoner code", and the metadata that comes out should have the value of the original datacite:title field under the our-schema:name field. This would of course also need to work for much more complex symmetries that involve related properties that are multiple levels removed from the dataset object.
We have previously looked into json-ld and framing and my limited experience and memory seem to suggest that it could not handle complex translations, the way that I hope reasoning might. I could very well be wrong about this because I did not get a complete grip of framing. Is this worth revisiting?

Bringing things together

OK, this has been many thoughts, and not yet a very concrete layout of what I think is important to do next. I am still unsure, and input from others will invariably impact this. But a few things are starting to feel more concrete and likely for me, in terms of TODOs:

get a better grasp of translating between schemas using RDF and reasoning: how can we build this component, and can it do what we need it to do?
get a concrete grip on the differences and similarities between schemas for dataset generation, form generation, and catalog rendering. Are they the same or different? Does one just inherit from the other by adding a superficial layout on top? Or would they need to be substaintially different, with full translation supported?
explore the detail of shacl-dash-based 1) annotations, 2) schema generation, 3) form generation, 4) catalog rendering
Redesign a catalog rendering system built on a pool of matchable rendering components, possible centered around DCAT.

I have made zero comments about how all of this affects our current commitments, deadlines, timelines.

psychoinformatics-de / datalad-concepts