thedatahub / Datahub

Datahub - A standards compliant metadata aggregator platform
GNU General Public License v3.0
9 stars 6 forks source link

[Records] A consistent content model across the API's. #90

Open netsensei opened 5 years ago

netsensei commented 5 years ago

If we want to expand the number of supported content models in the Datahub, we need to design a consistent content model across all API's.

Detailed description

Current situation

Version 1.0 of the Datahub is based on a simple content model:

A foundational principle of the application is that it needs to stay agnostic about the content model of the data. That is, the application doesn't do transformations of the stored data. Such transformations are by definition opinionated and context-bound. Supporting such transformations would also add hard-to-manage complexity to the application.

Instead, the application is designed to act as an end-point that packs data and publishes data as an abstract interface, hiding the underlying layers in a larger network of distributed data repositories. Data processing should happen as information gets funnelled through ETL or ELT pipelines.

However, the current simple model has severe drawbacks:

Context

Hashing this out as a set of governance principles / architectural design / content model is core to the entire application.

Clearly defining what is and is not possible is important because (a) it determines the flexibility of the application to cater to as many business cases as possible and (b) it allows potential users to assess whether the application is a good fit for their needs.

Finally, this is important because as solid foundation is crucial for defining a clear long-term roadmap in terms of adding new functionality and maintaining existing functionality.

Possible implementation

To be discussed in this issue.

netsensei commented 5 years ago

Principles

The Datahub is a simple data store and nothing more. The sole purpose of the application is storage and publication of data as network resources. There is no processing whatsoever of the data.

Here are a few basic guiding principles:

netsensei commented 5 years ago

Business cases

The Datahub should be able to cater to various business cases in a flexible way. Right now, there are several scenario's on the table:

Note how these cases simply define the case for organisations to disseminate datasets. They don't define governance of the data, or the specific business goals in terms of consumption of the data. Organisations may have underlying reasons to aggregate, or disseminate mixed datasets. These cases are simply focussed on dissemination and aggregation itself.

We can also expand on the notion "organisation". The first notion that comes to mind is a legal entity like a museum, a library, an archive (either public or private). Or a non-profit organisation that handles data. But it an "organisation" could be an ad hoc group of individuals as well i.e. a collaborative community of (non)professional users of data (researchers, journalists, artists,...) From this perspective, the Datahub as a tool for disseminating metadata is used within a collaborative context with more or less defined / aligned goals.

Finally, a single individual could conceivable use the application as a personal store for various datasets. However, the dominant use case targeted by this project is (collaborative) dissemination of metadata records on the Web.

netsensei commented 5 years ago

Security

Managing data

Part of the API allows for data ingest and data management (not transformation!). Data managers and administrators are able to manage and add data. Unauthenticated users and consumers shouldn't be able to modify stored records.

Access grants are modelled #76. Authentication is enforced through a user login system with OAuth for API clients.

Retrieving data

A basic principle could be "The Datahub is geared towards open access of published records." That is, data stored internally will be made publicly available if the application is publicly accessible. No exceptions. However, restricted dissemination of sensitive information is a valid business case we should consider as a part of the scope of the application.

Modelling fine-grained access control to records while keeping the flexibility to cater to a wide ranger of use cases is an extremely difficult exercise. Unless there's actually a hard demand for complex access rules, implementing them would be based on assumptions. Time and effort are at this point better divested towards making sure the content model proper is consistent and sound.

Coarse access control could be implemented through the consumer role though. The basic idea is "all or nothing". That is, when authentication is required, a user either is granted access to all the records, or none at all. In this case, access could be managed through the consumer role and OAuth. Or it could simply consists of a Basic Authentication configuration on the level of the HTTP server.

As a consequence, modelling fine-grained access control across multiple datasets provided by multiple organisations, implies installing and managing multiple instances of the application. The number of instances, associated accounts and OAuth clients is governed by high-level access policies.