Use a triple store? - Githubissues

michielbdejong commented 5 years ago

TL;DR: Let's only use a triple store as a cache, not as the main source of truth.

Several people have asked me why this module uses opaque blobs for both Linked Data Platform Non-RDF Source (LDP-NR) and Linked Data Platform RDF Source (LDP-RS) resources.

So far, the only RDF-aware part that has been implemented on top of the raw data storage is retrieving LDP container listings, and they need to be populated by the AtomicTree data structure anyway. For instance, if there are two resources on the servers, /foo/rdf.ttl and /foo/non-rdf.jpg, then the result of GET /foo/ is just the information that /foo/ contains ['rdf.ttl', 'non-rdf.jpg']. So it's based on the slash-terminated prefixes (/foo/ as a prefix of /foo/rdf.ttl and /foo/non-rdf.jpg) of the paths of other resources, which is unrelated to the actual triples stored in rdf.ttl, so storing rdf.ttl in a triple store doesn't help us there anyway.

But three other things that will be RDF-aware will be the ResourceUpdater (PATCH), the GlobReader (GET /foo/*) and LDP paging. I think this merits loading LDP-RS resource into an in-memory triple store for the purpose of completing these requests, even if we don't use that triple store as a state-full component in the architecture (storage would still be handled uniformly by the AtomicTree, for both LDP-NR and LDP-RS).

Also for possible future features like shape validation, or at least Turtle validation, we would require Turtle and JSON-LD parsing anyway, so we might as well us an off-the-shelf triple store for that.

A possible future spec feature that could become optional (but as far as I know is not part of the current Solid spec) is server-side SPARQL. I know several people would love to see support for this. I'm hesitant though, because it would encourage single-database design of applications. For instance, if my app lists the opening times of shops near my house, then in order to enable fast queries, I would probably store all triples about shop opening times on my own pod, and then query that using server-side SPARQL. But a much nicer design of such an app would be to have shop opening times on each shop's pod. That's how I would love to see the data web. So to avoid early optimization, I would like to make the design choice that wac-ldp-kit does not foresee being used in combination with a SPARQL endpoint that queries the same data storage.

Of course, you could put a SPARQL service in front of wac-ldp-kit, and wac-ldp-kit exposes an on-change event that would allow you to keep that cache up to date. The only prickly issue there would be access control, but you could say that the SPARQL end-point specific to a webID, e.g. is only available to the pod owner, or something like that. I propose we choose to really consider server-side SPARQL as a built-on optimization and consider LDP access as the only main basis.

Thoughts? Also, apart from that, if someone is aware of are any current or planned spec features that merit using a state-full triple store, then please comment here?

michielbdejong commented 5 years ago

Looks like https://github.com/linkeddata/rdflib.js is the thing to use for GlobReader, ResourceUpdater, LDP Paging, and RDF validation / shape validation.

pmcb55 commented 5 years ago

@michielbdejong You raise many points in the above, and to be honest, some of it seems quite muddled to me. It'll be quicker if we chat I think :)

kjetilk commented 5 years ago

It is really a long story, but the fundamental premise of the discussion should be the realization that we're doing decentralization for social and ideological reasons. From a technical viewpoint, it doesn't make much sense. Within the academic community, it has actually been pretty hard to have them consider decentralization is a contemporary research topic at all, since it has been researched so much in the past and found to make rather little sense. There is a pretty readable workshop paper that sums it up rather well in Learning from the History of Distributed Query Processing. A Heretic View on Linked Data Management

We all agree that we want data decentralized, like the example of each shop's opening time being hosted on their Pods. However, we shouldn't let the ideological standpoint get in the way of pragmatic solutions to the problems that we will inevitably face. Indeed, server-side SPARQL is an optimization, but everything that is needed for decentralization to be a feasible alternative can be categorized as optimizations.

So, if you can answer a quad pattern query over some data set, you can build a SPARQL engine around it. The LDP interface isn't the ideal atom, but it can be done. A more suitable atom is really the Triple Pattern Fragments interface towards a single resource, then the URL of the resource can reasonably be used as a graph name, and so, you have that quad pattern.

However, if a single quad pattern query is the only interface, you loose many other optimization possibilities, like the ones we discussed in Pushing complexity down the stack.

As I have said previously, I think caches are absolutely critical to the success of Solid, and they'll be on every level, from browser and app caches, to forward proxies, CDNs, reverse proxies, and close-to-pod caches, so that you can have a SPARQL query executing at least partially without network traffic. That is not to say that SPARQL-without-network traffic is the goal, but the goal is to have options available to query planners so that they have the flexibility to make a good plan. I think that is actually what would enable the "each shop has their opening hours on their own Pod" use case, not the unavailability of a server-side SPARQL endpoint. The plan has to be good enough for the performance requirements of the use case, that's the key. An LDP-only interface is very restrictive on the query planner, and therefore unlikely to cut it.

BTW, I couldn't sleep on the plane home from Boston, so I started to write a query planner that would support SPARQL queries over quad patterns with WAC: https://github.com/kjetilk/p5-web-access-control/blob/master/lib/Web/Access/Control/ It is actually fairly easy to do in the Attean framework. I don't know if it is also easy in Comunica, but I think it looks quite difficult in the Java frameworks. Just to say, it can be done. :-)

michielbdejong commented 5 years ago

Discussed this with @kjetilk and @pmcb55, conclusion:

in the interest of speed, I'll go ahead and build this module with two important assumptions:
- slashes in URL paths denote container membership, so if a document has URL foo/bar, then that tells us that it is a member of container foo/.
- the only indexing dimension explicitly supported by the storage layer is this container tree structure. So if you want to get all the triples from a specific document, that's fast. if you want to know all the members of a specific container, that's also fast. but if you want to know all triples that mention restaurants, then that's slow.
@pmcb55 suggested that as an additional requirement, i could spend 2 days looking at adding a SPARQL store in front of wac-ldp-kit, but i think the conclusion was that i would just keep this in mind, and then @kjetilk may work on this once he finished the Solid-TestSuite project.

In the light of the second point, I'll make sure that AtomicTree exposes an on-change event, that will allow a triple store to do one of two things:

option A: read in all triples from all LDP-RS documents stored on the AtomicTree (can choose, including or excluding containers?), and then update that cached version when an on-change event happens. This allows the triple store to use its own backend, but does lead to a duplication of data.
option B: only index the triples, but don't copy the actual triples; so only use the triplestore's own storage backend to store the indexing information, and use the AtomicTree instance as the only store of the actual data. I don't know enough about triple stores to know if this would save a lot of disk space, but at least it seems to me that AtomicTree with the on-change event would be flexible enough to allow this.

I'll leave this issue open for another week or so in case more people want to comment.

michielbdejong commented 5 years ago

will create a sister module for server-side pod-wide search, in a separate sprint within the V-Next project

michielbdejong / wac-ldp-kit

Use a triple store? #5