libero / publisher

The starting point for raising issues for Libero Publisher
MIT License
16 stars 4 forks source link

Persist articles in a database #292

Closed thewilkybarkid closed 4 years ago

thewilkybarkid commented 5 years ago

Store articles in a database rather than in memory.

Tasks

Related

thewilkybarkid commented 4 years ago

As it stands we're using JSON-LD everywhere (theory being developers are already familiar with JSON, so lessens the leap to RDF). This includes how we'll persist it (Postgres, with something like a table containing the local ID (probably a UUID) and the JSON-LD itself in jsonb). We won't need much querying/filtering, but it feels like this will make it hard to do so if we need to. It also means it's not aligned with the RDFJS group, so wouldn't be able to take advantage of the various libraries around that (related: did try out rdflib.js a little while ago but hit https://github.com/linkeddata/rdflib.js/issues/364).

A native RDF-way would be to use a graph database with SPARQL query+update (eg Amazon Neptune, Apache Jena). Big concern here is a lack of local expertise.

Also, performance is uncertain: we're looking at representing document content in RDF using https://github.com/stencila/schema, which could result in large, deep graphs for an article with a lot of content. Related to this, is how to provide mechanisms to update the content (exists for plain JSON-LD too).

thewilkybarkid commented 4 years ago

https://github.com/zazuko/hydra-box is an interesting idea, and touches on connecting the Hydra API documentation to its implementation, but not sure it really matches we what we need. It is an example of using SPARQL, however.

thewilkybarkid commented 4 years ago

@stephenwf, do you have experience with this or know anyone that does?

giorgiosironi commented 4 years ago

Waiting on the sidelines to see some service provider feedback on this. I can see system administrators being opinionated about running databases.

stephenwf commented 4 years ago

@mattmcgrattan I believe may be able to help with regard to SPARQL. I think though it's important to distinguish using RDF as a document representation with simple filtering capabilities (something that Postgres + Json field will handle with ease) and an entity-based graph.

Matt can correct me if I'm wrong, but personally, I think that a full SPARQL implementation may end up using a lot of memory, holding the raw triples in order to perform queries and that we'll end up under-utilising.

A good abstraction over the underlying DB will keep options open for swapping things out, if a more data-driven model is needed. I don't think the behaviours of the article-store would change in this case, just another service using the same DB to provide other functionality.

thewilkybarkid commented 4 years ago

In case it's useful, you can try one of the examples on https://hub.stenci.la/open/ and download it in Stencila's JSON-LD, which is probably pretty similar to what we'll be dealing with (though we'll probably have longer content and more metadata too).

thewilkybarkid commented 4 years ago

Wonder whether using RDFJS (ie statements/named graph) in code and a DB abstraction turns it to JSON-LD and stores in Postgres is a decent compromise.

Working with plain JSON-LD isn’t particularly fun due to all the possible forms.

alien-mcl commented 4 years ago

I think you shall ask yourselves a question on how users could utilize your platform. Full SPARQL implementation is expensive - in most cases it ends up having everything held in memory so the queries are executed fast enough. I remember we had a hybrid approach in one of the PoC projects where a single entity was held in a single document (it's boundaries were defined well enough so it could be stored this way) as a single good old SQL row carrying a turtle RDF serialization, but the code that fed storage with data was working on both SQL and RDF triple store. The SQL was the main data source when it came to day-by-day work with documents and the RDF/SPARQL endpoint was exposed for more advanced queries. I believe SQL could be replaced with document database holding JSON documents with fixed JSON-LD context and RDF triple store maintained separately or in DTC-like environments. You could also consider Virtuoso database - it integrates both SQL and RDF/SPARQL data stores, but that time I was able to work with it (version 6 I think) it was pretty unstable.

giorgiosironi commented 4 years ago

I think you shall ask yourselves a question on how users could utilize your platform.

I agree on listing the use cases here. This project is a microservice whose main use case so far is to write and read a single article (version?) plus providing a complete listing of all articles. Whether more complex listings and queries fall into this service would inform the implementation; they usually don't as we have implement this in separate services like a search that indexes content. The decisions we take here don't necessarily need to be identical elsewhere.

giorgiosironi commented 4 years ago

a DB abstraction turns it to JSON-LD and stores in Postgres

Pubsweet's move from the niche PouchDB to a generic PostgreSQL lists production support, data integrity, tooling and size and activity of community as the reasons.

thewilkybarkid commented 4 years ago

Useful comments from @tpluscode in https://httpapis.slack.com/archives/C1JP575EX/p1574418083179100 too.

This project is a microservice whose main use case so far is to write and read a single article (version?) plus providing a complete listing of all articles.

There will be an unclear amount of processing too (preview/published at least, but asset handling might (partially) happen here).

fixed JSON-LD context

I think this is going to be key. We can't move away from RDF as we're supporting unknown properties, but having a consistent structure will be key to being able to do anything useful (without extracting data out into a separate representation).

My current thinking is that we have a table with four columns: auto-increment ID (purely for consistent ordering), internal UUID, JSON-LD, hash (for etags).

Filters that I imagine we might need:

(all for access control).

giorgiosironi commented 4 years ago

There will be an unclear amount of processing too (preview/published at least, but asset handling might (partially) happen here).

From a (tactical) DDD perspective, as long as it's on a single Aggregate the persistence mechanism can be very agnostic as it exposes primitives such as loading a single Aggregate in memory and persisting the changes back as an atomic operation.

Thinking with that hat, relational or more in general ACID may have the advantage of supporting storing state and events to be published in a single transaction.

Filters that I imagine we might need: Publisher Journal

The level of detail of these is not completely defined due to https://github.com/libero/publisher/labels/multitenancy not being complete, but it's likely architecturally the same: a small set of attributes are associated with each article to link it to the tenant and other strongly isolated containers like the journal.

thewilkybarkid commented 4 years ago

Split the RDF/JS idea into https://github.com/libero/publisher/issues/346.

nlisgo commented 4 years ago

I've moved this back to do. I did some investigation on to an approach to setup the database table and to begin storing POST request in the database. I've updated the DoD to reflect smaller chunks of work that can be split into their own tasks and moved into to do and referenced here.

giorgiosironi commented 4 years ago

Checking a few tasks from the description corresponding to a merged PR. Is this open or closed?

giorgiosironi commented 4 years ago

@nlisgo to decide.

thewilkybarkid commented 4 years ago

I've read through the ticket, and this should definitely be closed.