[Records] A consistent content model across the API's.

If we want to expand the number of supported content models in the Datahub, we need to design a consistent content model across all API's.

Detailed description

Current situation

Version 1.0 of the Datahub is based on a simple content model:

Data is stored as "records" in the database. A record is a container which stores a serialised representation of the data.
Data is ingested through POST/PUT requests on the API. The API only expects XML strings. Upon ingest, the XML string is stored in a record. At ingest time, the XML string is transformed into a JSON string via Clark Notation. As such, transformation doesn't imply a mapping between data models.
The REST API returns records as JSON documents which are Clark annotated XML documents. The API leverages HATEOAS for discovery of these JSON documents.
The OAI-PMH endpoint returns the actual XML documents from the records.
We only support one format per instance. It's not possible to ingest multiple formats (MODS, JSON-LD, EAD, MARCXML, etc.)

A foundational principle of the application is that it needs to stay agnostic about the content model of the data. That is, the application doesn't do transformations of the stored data. Such transformations are by definition opinionated and context-bound. Supporting such transformations would also add hard-to-manage complexity to the application.

Instead, the application is designed to act as an end-point that packs data and publishes data as an abstract interface, hiding the underlying layers in a larger network of distributed data repositories. Data processing should happen as information gets funnelled through ETL or ELT pipelines.

However, the current simple model has severe drawbacks:

It's not possible to combine multiple content models in the application (i.e. EAD & MODS)
Clark Notation is very hard to consume. It forces clients to do an extra conversion step: JSON to XML and then extracting data from XML.
Clark Notation isn't a common model. There are only few parsers and libraries that support this model.
JSON and XML are serialisation formats. Whereas XML allows for explicit description of the content model (XSD schema's), JSON doesn't do that by design. Presenting data simply as application/json or application/xml isn't enough to give clients a cue of the actual content model.
Even if we did add support for multiple formats and models, we need to hash out how we are going to approach unambiguously identification of resources. Multiple knowledge domains and contexts (i.e. library v. collection v. archive plus various sub-collections, hybrids, etc.) may wield identifier schemes which result in overlaps (i.e. record "000.123" may refer to a monograph in the library collection and a painting in a museal collection.) What are our design principles governing identification of resources?

Context

Hashing this out as a set of governance principles / architectural design / content model is core to the entire application.

Clearly defining what is and is not possible is important because (a) it determines the flexibility of the application to cater to as many business cases as possible and (b) it allows potential users to assess whether the application is a good fit for their needs.

Finally, this is important because as solid foundation is crucial for defining a clear long-term roadmap in terms of adding new functionality and maintaining existing functionality.

Possible implementation

To be discussed in this issue.

Principles

The Datahub is a simple data store and nothing more. The sole purpose of the application is storage and publication of data as network resources. There is no processing whatsoever of the data.

Here are a few basic guiding principles:

Keep the modelling of business constraints outside the implementation of the application. That is, both the internal database model and the resource model in the API are only marginally opinionated.
It's possible to model meaningful groups of records through three vectors: the identifiers, sets and media types.
The datahub contains "items" which are identified by an OAI-identifier (which is an URI). But the datastreams (either XML or JSON) outputted by the application are "records". An item can thus result in multiple records.
The Datahub doesn't store "items" however, but "records" since the serialised data in a record isn't deserialised and normalised into an application-specific datamodel.

Business cases

The Datahub should be able to cater to various business cases in a flexible way. Right now, there are several scenario's on the table:

An organisation that simply wants to disseminate a single dataset referring to physical objects in a particular collection i.e. museum collection, library holdings, archival holdings. Or even a particular subset of a collection i.e. a bibliography concerning a particular subject, person, place or era.
An organisation that wants to disseminate several datasets via a single endpoint i.e. a dataset having records referring to physical objects AND a dataset having records referring to a bibliographic collection. A more concrete case would be a museum that wants to disseminate both collection records, and library records which may be related to the collection records.
Several organisations aggregate various datasets together. These datasets may be aligned i.e. datasets pertaining to complementary collections of physical objects. Or datasets pertaining to collections of complementary collections of physical objects and bibliographic collections.

Note how these cases simply define the case for organisations to disseminate datasets. They don't define governance of the data, or the specific business goals in terms of consumption of the data. Organisations may have underlying reasons to aggregate, or disseminate mixed datasets. These cases are simply focussed on dissemination and aggregation itself.

We can also expand on the notion "organisation". The first notion that comes to mind is a legal entity like a museum, a library, an archive (either public or private). Or a non-profit organisation that handles data. But it an "organisation" could be an ad hoc group of individuals as well i.e. a collaborative community of (non)professional users of data (researchers, journalists, artists,...) From this perspective, the Datahub as a tool for disseminating metadata is used within a collaborative context with more or less defined / aligned goals.

Finally, a single individual could conceivable use the application as a personal store for various datasets. However, the dominant use case targeted by this project is (collaborative) dissemination of metadata records on the Web.

Security

Managing data

Part of the API allows for data ingest and data management (not transformation!). Data managers and administrators are able to manage and add data. Unauthenticated users and consumers shouldn't be able to modify stored records.

Access grants are modelled #76. Authentication is enforced through a user login system with OAuth for API clients.

Retrieving data

A basic principle could be "The Datahub is geared towards open access of published records." That is, data stored internally will be made publicly available if the application is publicly accessible. No exceptions. However, restricted dissemination of sensitive information is a valid business case we should consider as a part of the scope of the application.

Modelling fine-grained access control to records while keeping the flexibility to cater to a wide ranger of use cases is an extremely difficult exercise. Unless there's actually a hard demand for complex access rules, implementing them would be based on assumptions. Time and effort are at this point better divested towards making sure the content model proper is consistent and sound.

Coarse access control could be implemented through the consumer role though. The basic idea is "all or nothing". That is, when authentication is required, a user either is granted access to all the records, or none at all. In this case, access could be managed through the consumer role and OAuth. Or it could simply consists of a Basic Authentication configuration on the level of the HTTP server.

As a consequence, modelling fine-grained access control across multiple datasets provided by multiple organisations, implies installing and managing multiple instances of the application. The number of instances, associated accounts and OAuth clients is governed by high-level access policies.

thedatahub / Datahub