Closed thewilkybarkid closed 4 years ago
As it stands we're using JSON-LD everywhere (theory being developers are already familiar with JSON, so lessens the leap to RDF). This includes how we'll persist it (Postgres, with something like a table containing the local ID (probably a UUID) and the JSON-LD itself in jsonb
). We won't need much querying/filtering, but it feels like this will make it hard to do so if we need to. It also means it's not aligned with the RDFJS group, so wouldn't be able to take advantage of the various libraries around that (related: did try out rdflib.js
a little while ago but hit https://github.com/linkeddata/rdflib.js/issues/364).
A native RDF-way would be to use a graph database with SPARQL query+update (eg Amazon Neptune, Apache Jena). Big concern here is a lack of local expertise.
Also, performance is uncertain: we're looking at representing document content in RDF using https://github.com/stencila/schema, which could result in large, deep graphs for an article with a lot of content. Related to this, is how to provide mechanisms to update the content (exists for plain JSON-LD too).
https://github.com/zazuko/hydra-box is an interesting idea, and touches on connecting the Hydra API documentation to its implementation, but not sure it really matches we what we need. It is an example of using SPARQL, however.
@stephenwf, do you have experience with this or know anyone that does?
Waiting on the sidelines to see some service provider feedback on this. I can see system administrators being opinionated about running databases.
@mattmcgrattan I believe may be able to help with regard to SPARQL. I think though it's important to distinguish using RDF as a document representation with simple filtering capabilities (something that Postgres + Json field will handle with ease) and an entity-based graph.
Matt can correct me if I'm wrong, but personally, I think that a full SPARQL implementation may end up using a lot of memory, holding the raw triples in order to perform queries and that we'll end up under-utilising.
A good abstraction over the underlying DB will keep options open for swapping things out, if a more data-driven model is needed. I don't think the behaviours of the article-store would change in this case, just another service using the same DB to provide other functionality.
In case it's useful, you can try one of the examples on https://hub.stenci.la/open/ and download it in Stencila's JSON-LD, which is probably pretty similar to what we'll be dealing with (though we'll probably have longer content and more metadata too).
Wonder whether using RDFJS (ie statements/named graph) in code and a DB abstraction turns it to JSON-LD and stores in Postgres is a decent compromise.
Working with plain JSON-LD isn’t particularly fun due to all the possible forms.
I think you shall ask yourselves a question on how users could utilize your platform. Full SPARQL implementation is expensive - in most cases it ends up having everything held in memory so the queries are executed fast enough. I remember we had a hybrid approach in one of the PoC projects where a single entity was held in a single document (it's boundaries were defined well enough so it could be stored this way) as a single good old SQL row carrying a turtle RDF serialization, but the code that fed storage with data was working on both SQL and RDF triple store. The SQL was the main data source when it came to day-by-day work with documents and the RDF/SPARQL endpoint was exposed for more advanced queries. I believe SQL could be replaced with document database holding JSON documents with fixed JSON-LD context and RDF triple store maintained separately or in DTC-like environments. You could also consider Virtuoso database - it integrates both SQL and RDF/SPARQL data stores, but that time I was able to work with it (version 6 I think) it was pretty unstable.
I think you shall ask yourselves a question on how users could utilize your platform.
I agree on listing the use cases here. This project is a microservice whose main use case so far is to write and read a single article (version?) plus providing a complete listing of all articles. Whether more complex listings and queries fall into this service would inform the implementation; they usually don't as we have implement this in separate services like a search
that indexes content. The decisions we take here don't necessarily need to be identical elsewhere.
a DB abstraction turns it to JSON-LD and stores in Postgres
Pubsweet's move from the niche PouchDB to a generic PostgreSQL lists production support, data integrity, tooling and size and activity of community as the reasons.
Useful comments from @tpluscode in https://httpapis.slack.com/archives/C1JP575EX/p1574418083179100 too.
This project is a microservice whose main use case so far is to write and read a single article (version?) plus providing a complete listing of all articles.
There will be an unclear amount of processing too (preview/published at least, but asset handling might (partially) happen here).
fixed JSON-LD context
I think this is going to be key. We can't move away from RDF as we're supporting unknown properties, but having a consistent structure will be key to being able to do anything useful (without extracting data out into a separate representation).
My current thinking is that we have a table with four columns: auto-increment ID (purely for consistent ordering), internal UUID, JSON-LD, hash (for etags).
Filters that I imagine we might need:
(all for access control).
There will be an unclear amount of processing too (preview/published at least, but asset handling might (partially) happen here).
From a (tactical) DDD perspective, as long as it's on a single Aggregate the persistence mechanism can be very agnostic as it exposes primitives such as loading a single Aggregate in memory and persisting the changes back as an atomic operation.
Thinking with that hat, relational or more in general ACID may have the advantage of supporting storing state and events to be published in a single transaction.
Filters that I imagine we might need: Publisher Journal
The level of detail of these is not completely defined due to https://github.com/libero/publisher/labels/multitenancy not being complete, but it's likely architecturally the same: a small set of attributes are associated with each article to link it to the tenant and other strongly isolated containers like the journal.
Split the RDF/JS idea into https://github.com/libero/publisher/issues/346.
I've moved this back to do. I did some investigation on to an approach to setup the database table and to begin storing POST request in the database. I've updated the DoD to reflect smaller chunks of work that can be split into their own tasks and moved into to do and referenced here.
Checking a few tasks from the description corresponding to a merged PR. Is this open or closed?
@nlisgo to decide.
I've read through the ticket, and this should definitely be closed.
Store articles in a database rather than in memory.
Tasks
Related
Blocked by #290 (add article collection).Relates to #291 (creation of an article).