solid / specification

Solid Technical Reports
https://solidproject.org/TR/
MIT License
475 stars 42 forks source link

Querying multiple subjects in one request #162

Open joepio opened 4 years ago

joepio commented 4 years ago

In an RDF based app I was working on, we started out with a RESTful RDF API where the client used one request to fetch one resource (one GET to each subject). This was quite costly, since every HTTP request has some overhead, and in many single page views, many resources would need to be fetched. To fix this, we implemented a 'bulk-API' which basically boils down to an endpoint where you could post a body of new-line-deliniated URLs, and the server would reply with one large n-triples document that contains all the statements for the requested subjects. Not a pretty solution, but it helped make the app faster.

Currently, most solid apps work like our old API: one request per subject. In order to provide a snappy UI, we need a more performant way to query pods.

Options

SPARQL endpoint

SPARQL is a very powerful query language that will definitely solve this problem, but it can be hard for new developers and costly on the server. Exposing a SPARQL endpoint is often not a great solution for production apps.

TPF endpoint (+ multiple subjects!)

Triple Pattern Fragments were designed as a easier, low-cost SPARQL alternative. However, the spec does not allow for querying multiple subjects in one request. If we allow multiple subjects in the subject field, this would the the aforementioned usecase.

Perhaps there are other options, would love to hear your thoughts.

RubenVerborgh commented 4 years ago

Hi @joepio,

I'll reply in detail to this post, because this might become a reference point for discussions on query interfaces and Solid. I'll try to highlight some of the finer points, and clear up some possible misunderstandings.

In an RDF based app I was working on, we started out with a RESTful RDF API where the client used one request to fetch one resource (one GET to each subject). This was quite costly, since every HTTP request has some overhead, and in many single page views, many resources would need to be fetched. To fix this, we implemented a 'bulk-API'

The first remark here is that the juxtaposition of REST and query-based APIs as two different categories, is not correct. I don't mean to be pedantic, but in the discussions that will follow, it is important to get the terminology sufficiently precise.

What I assume is meant here is a comparison between:

a) document-per-subject interfaces, such as the one that is typically the result of LDP as we apply it in Solid. That is, the resources consumed by a client are documents that contain triples related to a specific concept. The division into documents is server-determined; that is, you receive a URL and you GET it. For instance, https://ruben.verborgh.org/profile/ contains triples related to me.

b) custom-selector-based interfaces, where the client is responsible for creating a specific query, and the response is a list of triples that satisfy that query. For example, https://data.verborgh.org/ruben?subject=&predicate=&object=https%3A%2F%2Fwww.w3.org%2FPeople%2FBerners-Lee%2Fcard%23i gives you triples in my dataset where Tim is the object.

Either of them can be implemented with compliance to the REST architectural style; in fact, both examples I have given comply.

Not a pretty solution, but it helped make the app faster.

So, what the real problem seems to be here, is access to data about multiple subjects?

post a body of new-line-deliniated URLs, and the server would reply with one large n-triples document that contains all the statements for the requested subjects

So here's the thing: under HTTP/2, you should find that just performing GET requests to the individual URLs will give you the same, if not better performance (because of caching). However, it is very likely that, if you try this with NSS, performance is bad; NSS has various efficiency problems with handing individual requests, and costs accumulate.

But so far, not an interface challenge to me.

SPARQL is a very powerful query language that will definitely solve this problem, but it can be hard for new developers and costly on the server. Exposing a SPARQL endpoint is often not a great solution for production apps.

Couple of things here:

  1. We need to clearly separate the query language from the interface language. They are separate things. (Check in particular this slide.) So whatever we send over the interface, is independent of however developers instruct a client. Hence, the hard to learn argument need not apply. Furthermore, the original problem above was phrased as an efficiency problem, not a developer problem, so we definitely have to keep those concerns separated.

  2. Regarding server cost: note that the case that myself and others have made against SPARQL endpoints, concerns public SPARQL endpoints. If we are talking about authorized applications and/or users, that is not as strong of a concern. The bigger concern is the complexity of implementing a SPARQL interface over non-database backends, such as a file system. They would typically require a cache or another redundant system for performance reasons.

  3. Regarding SPARQL endpoint not a great solution for production apps: also, arguments made in that direction only hold for public endpoints. There is enough evidence for high-performance SPARQL solutions in closed contexts.

Triple Pattern Fragments were designed as a easier, low-cost SPARQL alternative.

To be very specific: TPF is a REST API that affords triple-pattern-based lookups. The idea is that, since triple-patttern-based access is conceptually simple compared to more advanced query functionality, it is easier to build it on top of non-database backends. Furthermore, it leverages caching better, which makes it better suited for public interfaces and low-cost scenarios.

However, the spec does not allow for querying multiple subjects in one request.

That by itself, we should be able to mitigate with the same or better performance on HTTP/2, following a similar line of argumentation as above.

If we allow multiple subjects in the subject field, this would the the aforementioned usecase.

From the perspective of SPARQL query evaluation, this would essentially amount to brTPF (http://olafhartig.de/brTPF-ODBASE2016/).

From an efficiency perspective, we would not necessarily gain much.

I also want to note that, in both cases, we'd be retrieving triples of the form <subject> ?p ?o, which will likely be very different from triples listed in the document`. So the original and modified proposals don't yield the same results.

TL;DR: I think that the question here is "fast access to multiple documents" and that the appropriate answer is "HTTP/2" (and a decent server implementation).

namedgraph commented 3 years ago

So to be clear: SPARQL support in Solid is not planned?

RubenVerborgh commented 3 years ago

To be more exact: there are no plans to mandate support for a full SPARQL endpoint on the server side. (That does not mean that SPARQL cannot be used / supported in Solid, since there is also client side.)

gibsonf1 commented 3 years ago

How about a very simple standard endpoint on solid in the format of /search/ that takes a string of urlencoded characters, similar to typing in a google search box, that the Solid Server can respond to in whatever form the client requests such as turtle etc?