SEARCH and/or GET+Query method

solid / specification

Solid Technical Reports

https://solidproject.org/TR/

MIT License

493 stars 45 forks source link

SEARCH and/or GET+Query method #229

Open bblfish opened 9 years ago

bblfish commented 9 years ago

In the section "Reading data using SPARQL" I suggest instead using the SEARCH METHOD ( see the recent draft-snell-search-method-00 RFC which is being discussed on the HTTP mailing list currently and that is gaining momentum )

I have implemented that already in rww-play as described in that curl interaction page

 $ curl -X SEARCH -k -i -H "Content-Type: application/sparql-query; charset=UTF-8" \
    --cert ../eg/test-localhost.pem:test \
    --data-binary @../eg/couch.sparql https://localhost:8443/2013/couch
HTTP/1.1 200 OK
Content-Type: application/sparql-results+xml
Content-Length: 337

<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
    <head>
        <variable name='D'/>
    </head>
    <results>
        <result>
            <binding name='D'>
                <literal datatype='http://www.w3.org/2001/XMLSchema#string'>Comfortable couch in Artist Stables</literal>
            </binding>
        </result>
    </results>
</sparql>

Given that most other WebDAV methods are implemented ( see issue solid/solid-spec#3 ) this should be an easy addition, and seems less ad hoc than what is currently being suggested namely

GET /data/ HTTP/1.1
Host: example.org
Query: SELECT * WHERE { ?s ?p ?o . }

bblfish commented 9 years ago

Having said that, I have been recently arguing for the GET method on the IETF mailing list.

elf-pavlik commented 9 years ago

@RubenVerborgh how do you see such Query HTTP header fitting Linked Data Fragments and possibility of using Triple Pattern Fragments there along SPARQL in particular?

seeAlso:

https://github.com/linkeddata/SoLiD#reading-data-using-sparql
solid/solid-spec#1

bblfish commented 9 years ago

SEARCH or GET with a query would allow for any type of query language to be used. GET with QUERY body or header would of course be cacheable. Not sure why SEARCH should not be - that was the question I had to the SEARCH proposal.

elf-pavlik commented 9 years ago

@bblfish wouldn't use of URI Template cover many of common cases e.g. https://github.com/linkeddata/SoLiD/blob/master/UserStories/PrivateSharing.md#send-notice-to-jane

see: http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/#controls

bblfish commented 9 years ago

@elf-pavlik URI Templates increase the number of Resources, are bad for caches, have no semantics so require a mapping to some semantics, are less flexible, etc... So no. If you want to do things correctly then SEARCH/GET+Query is much better.

Mind you having said that, it's already in the SoLiD proposal as GET + Header. So I don't think we need to have this issue open anymore.

bblfish commented 9 years ago

I don't know any efficient HTTP cache that considers the request body for cache keying,

It's up to us to write them. Because of CORS we have to go through proxies anyway, and our local clients need to have caches of remote graphs too. It is much more complex to write caches if each resource has a million different URLs, when each of these is just a partial representation of the same resource. That is just an extension of what HTTP 1.1 partial content does and for which there is even a 206 result code.

Btw the neat thing about GET+Header or GET+Body is that if the server does not know about it the resource returns the full representation. You can try it out: this actually works on all current servers, that are told to ignore bodies on a GET as the semantics for it is not yet defined.

Furthermore, I need addressability of my resources. In particular, a TPF resource must be able to refer to itself.

The Oracle/IBM/... SEARCH proposal mentioned allows for there to be a Location header for refereability if needed.

I would say that URI templates do not increase the number of resources.

It clearly does: every URL refresh to a resource, and there is no way for caches to know that two URLs are partial representations of the same resource. As a result

caches get bloated: DELETEing, PATCHing, or PUTing a resource does not delete the caches of the templated URLs
it is bad for security as it multiplies the resources and their access control rules
It is not extensible as there is no query language selection mechanism
caches may not be able to track etag updates
etc, etc....

That's why we have templates: to restrict the degree of freedom.

You can do that by resticting the query language too.

bblfish commented 9 years ago

Btw the neat thing about GET+Header or GET+Body is that if the server does not know about it the resource returns the full representation.

That seems disastrous actually if that resource is a 500M triple dataset.

That is where the SEARCH method is useful, as it does not have that effect.

The Oracle/IBM/... SEARCH proposal mentioned allows for there to be a Location header for refereability if needed.

How can the cache possibly know that the two SPARQL queries SELECT * { ?a ?b ?c } and SELECT * { ?x ?y ?z } without actually parsing SPARQL?

Clearly if it can't understand SPARQL it can't understand SPARQL. If it can it can. Not sure where you are going there. Also SPARQL need not be the only query language required. And this does not exclude the current way of doing things.

bblfish commented 9 years ago

How exactly would SEARCH reduce the number of resources compared to GET?

You mean: "how can SEARCH reduce the number of resources compared to GET+URL-templates?"

First note that GET + Query header or body and SEARCH have the same number of resources.

The reason for using GET+Query header or body is the same as why one would use HTTP Partial Content . The resource is always the same URL as the document on which you do a GET.

In the case of template Queries the number of URLs is obviously more: one for each URL matching the template. So instead of one URL you can have millions or more. I don't really see how you can fail to see that. And how you can fail to see that intermediaries may not know that these all map to the same resource.

And if we need to assume the cache understands a query language X, can't we just have the same assumption with GET then as well?

In the case of GET + Query header or body if the server fails to understand the query it can just send the full content that comes back, and all intermediaries can cache it without loss as usual. If a cache understands how to cache a query it could potentially build up a partial representation of the remote resource as with HTTP partial content. In the case of GET + *URL templates the cache if it fails to understand the template will cache an infinite amount more info than it would actually have needed to cache, and it may server data that is out of date, when it should have known that a PUT, PATCH, etc.. invalidated it.

The query language comes with a well defined mime type, which makes interpretation of it possible. URL templates do not have mime types. In fact it is not recommended part of web architecture to look into URLs to guess meanings. But you could clearly do something to map URL templates to SPARQL queries, by following links, which means extra HTTP requests, which one always wants to avoid. There is a use case for that. But there is also a use case for SEARCH or GET+Query header or body, one which I find very useful.

SEARCH or GET+Query header or body gives the client more freedom, and is better for caches.

bblfish commented 9 years ago

Note also that it currently does the right thing. If the server does not understand the right headers it returns the full content:

For GET + Query header:

$ curl -H "Query: DESCRIBE <#me>" -H "Accept: text/turtle" \
          -H "Query-Content: application/sparql-query" \
         http://bblfish.net/people/henry/card

or for GET + Body

$ curl -H "Accept: text/turtle" \
         -H "Content-Type: application/sparql-query" \
         --data-binary "DESCRIBE <#me>" \
        http://bblfish.net/people/henry/card

You always have the same URL http://bblfish.net/people/henry/card whatever query you send the request to. This is made explicit in HTTP 1.1 section 4.3

if the request method does not include defined semantics for an entity-body, then the message-body SHOULD be ignored when handling the request.

I don't know if the same is made explicit for query urls. Should query URLs always be ignored, by the server if it does not understand it?

bblfish commented 9 years ago

SEARCH or GET + Query header or body gives the client more freedom, and is better for caches.

You haven't convinced me of that. It's simply putting the query in a different place (body or header instead of in a URL template), and caches still have to work as hard (even a little more, because they also need to use the mime type in the cache key, not just the URL).

The difference is simple: you create nearly infinitely more URLs in the template case. I suppose you don't see this because you must be thinking of all urls with attribute values after the query as being the same URL without those attribute value pairs. ie. you must be counting the following all as one URL:

http://example.org/card
http://example.org/card?attr=value&attr2=value2
http://example.org/card?attr2=value&attr=value2
http://example.org/card?attr=value&other

But they are different URLs: in fact there are 4 urls there. And they refer to different things usually ( unless the resource describes them all as being owl:sameAs each other ). In any case for the purpose of most caches they refer to different things. In the GET+Query or SEARCH case there is always only one URL which is the object of the request: in the above case that would be http://example.org/card

In the GET+Query response or in SEARCH responses the intermediary cache can just simply not cache the response ( that what the 206 Partial Content response code is for btw) Whereas in the template the cache has to cache each different representation because it does not know it is just a Partial Content of the original.

Also, there are also queries that you just can't put in a query URL because of URL length limitation.

SEARCH or GET+Query gives the client more freedom in cases where the resource accepts a query, since
- the query language can be more powerful
- the length of the query can be longer than what can be accepted in a template mechanism
- the client does not have to be restricted in its query to a pre-defined template
SEARCH or GET+Query is better for caches
- because intermediaries don't need to keep representations of partial content if they don't understand them
- a PUT, PATCH, DELETE on a resource will immediately help the cache understand that the partial representations should be invalidated. I don't think this is the case for template URLS. The cache would need to understand the relation between all the template urls, which is extra information, it might not have or have access to
- Caches could potentially reconstruct the full representation as they can do with partial content
It is better for security because there are less resources to watch
etc. etc.

bblfish commented 9 years ago

The same can be achieved with query= like SPARQL does (and if you want to support different languages, just offer different templates, like bla/sparql?query= and bla/other?query=.) Bonus: this way, the supported query languages are explicitly indicated through hypermedia.

You can do that. But that creates many different URLs that are not tied to the original resource. What I am interested in in GET+Query is to have all the queries tied to the same URL: the resource that is being queried for Partial Content functionality.

Any cache can choose to ignore URLs that contain ?.

Can you refer to a RFC for that?

Again, disagree. I still don't believe the number of resources would be any different.

They are, that's basic Web Architecture, and a question of epistemology. A server needs to have special information to deduce that two different urls refer to the same resource. In the examples given above the URLs furthermore do refer to different resources: different parts of a specific resource, if you wish. I suppose you should look up Range Requests first and ask yourself why those have been put in place.

bblfish commented 9 years ago

@RubenVerborgh wrote:

And if you say that, thanks to the Location header, we still have addressability, then I'd say you have the same problem all over again, because the URL in the Location header also has infinity variants.

It's best there to not ask me but to go to the original document Draft-Snell-Search-Method-00

In some cases, the search arbiter might choose to respond indirectly to the SEARCH request by returning a 3xx Redirection with a Location header specifying an alternate Request URI from which the search results can be retrieved by using an HTTP GET request.

That is the Location header is Optional. That is a way to bridge both worlds if you wish.

bblfish commented 9 years ago

So please pinpoint the error in my reasoning above, so that I can see why there would be less different GET+Query requests than different GET`+Template requests.

In the GET+Query as with SEARCH you only have one URL to which the requests go, in the template case an infinite number of URLs. You can see this in the first line

GET /card HTTP/1.1
Host: example.org
Content-Type: application/sparql+query
Accept: text/turtle
Content-Length: 14

DESCRIBE <#me>

With response

206 Partial Content
Content-Type: text/turtle
Content-Length: ...

All other queries using GET for the resource <http://example.org/card> will start with that same line, namely GET /card HTTP/1.1. They may have different bodies clearly, but the resource that is the object of the request is always the same.

In the template cases you have instead the following

GET /card?query=q1 HTTP/1.1

GET /card?query=q2 HTTP/1.1

GET /card?query=q3 HTTP/1.1

etc...

In each case the attribute of the method GET is a different URI. In one case http://example.org/card?query=q1, in the next case http://example.org/card?query=q2 in the next case http://example.org/card?query=q3 . These are different resources. That's all there is to it. Clients and intermediate caches cannot assume these are the same resources. That's why I say its a question of epistemology -ie. of knowledge. In the GET+Query case it is always the same URL, namely in the example http://example.org/card . The mappings that you want to make are immaterial to the argument. The GET+Query forces the issue that they are the same resource. That is what I want. It NECESSITATES them to be about the same resource, therefore it necessitates the notion of partial content. That is why I keep referring you to Partial Content in HTTP RFC.

The template URLs do not have that necessity.

bblfish commented 9 years ago

then our argument simply comes down to a different definition of “resource”.

The question is what is it that is operated upon. In the definition of resource that I take from the HTTP specs which are definitional, the resource operated upon is the same, independent of the query that is requested of it. That is why Range Requests and 206 Partial Content HTTP codes are part of the standard. So this definition has wide and explicit support by current HTTP specs, and has a whole spec ( RFC7233 ) to support the importance of it.

Your definition makes distinctions at the level of representation, rather than at the level of the resource. It works at the level of resource only with template URLs, that name each one of the triples, ie each one of the representations.

Ie. you could name each of your triple with a template url

</card?d=me> names (http://example.org/card, application/sparql+query, DESCRIBE <#me>) </card?d=her> names (http://example.org/card, application/sparql+query, DESCRIBE <#her>) ...

But then according to your own definition each of the template URLs refers to a different resource: namely a different triple.

I believe that you are inclined to map the response representations returned by the 206 Partial Content to your triple resources, because you are interested in representations. ( And indeed the SEARCH proposal's optional Location header provides a bridge between these two concepts ). But what gets lost if you only consider the representations as being nameable, is the unity of the resource which is acted upon. Hence you loose the caching features and the unifying aspect of the original resource on which the action is directed. HTTP is about acting on resources via transfer or representations.

Another way of understanding this is via the motto: SEARCH or GET+Query is to GET what PATCH is to PUT. PATCH does not do anything different to PUT: it updates a resource, but more efficiently. Similarly SEARCH or GET+Query just more efficiently fetches a resources. ( And this effiency requirements led to RFC7233 ). Template URLs potentially get you the same representation, but at the cost of also creating new resources with a new URL - one <URL,mimetype,query> triple - for each representation. So it goes beyond just a simple GET. And this is what we want to avoid. ( though of course we don't make the other illegal)

bblfish commented 9 years ago

just like the resource addressed in a SOAP service is also always the same for each request, regardless of the actual interaction.

Except it is very different from SOAP, since in SOAP the message then contained information about what methods were to be run on which resource. This is not the case here, the proof being RFC7233. Are you saying that RFC7233 is a SOAP like protocol? Do you think that it would pass muster with Roy Fielding who was one of the editors? Come on. Please take some time to reconsider.

bblfish commented 9 years ago

When doing a (http://example.org/card, application/sparql+query, DESCRIBE <#me>) request, the resource the user is interested in is "the #me part of card", not "card". This is why I find it strange to identify the latter instead of the former.

That is not true in my case. I am writing client apps that sometimes need small portions of a graph to do the rendering quickly. Perhaps it just needs the name and email of a person in a foaf-profile. It does not want a different resource, but just a part of the resource it would have gotten in full.

Sorry the topic of this e-mail is SEARCH and GET. If @elf-pavlik wants to open a different topic, please do so in a different issue. You have less options in the template system that's why you have a somewhat smaller number of results that can come back. But that's a different issue.

I brought up RFC7233 because it fundamentally contradicts your argument about the RESTfulness of the proposed protocol. In short if you wish to argue against GET+Query you are arguing against RFC7233. The discussion channel for that is the http-wg mailing list.

elf-pavlik commented 9 years ago

Very interesting discussion @RubenVerborgh & @bblfish, thanks for taking your time to understand each other better and clarify various subtle differences. I guess difference comes with types of data sets you two may tend to work with. For public, open knowledge data sets like dbpedia, wikidata etc. URI Templates seem to make a lot of sense. In case of social networking and resources with access control, SEARCH or GET+Query might provide simpler ways for setting ACL. Especially if resources stay persisted in separate files in file system, not in triple/quad store. Not sure if system which uses triple/quad store could as easily check ACL of queries as system which stores data fragmented on file system (tree).

bblfish commented 9 years ago

My argument was not at the level of whether information is stored in a graph store or as files. HTTP is defined in terms of resources, which are identified by URIs. How the information is stored is immaterial to HTTP, as it is a communication protocol.

Template URLs are not excluded by GET+Query. Of course if one implements GET+Query one would also not tend to implement template URLs, as the former will, depending on the Query language, tend to be a superset of the templating option. All queries in a template can be expressed in GET+Query but usually not vice versa.

What GET+Query gives the client is the secure knowledge that it is getting a part of the resource that it wanted, and not another resource. The templating option, given that it creates a different URI for each <uri,mime-type,query> triple, does not provide the same transparency in this regard. In the GET+Query case clients and intermediaries know a-priori that they are dealing with the same resource, as in the case of RFC7233, in the other case it needs to be deduced a-posteriori: it is something that needs to be established empirically. The client or cache needs to do extra work to determine wether or not the two URIs refer to different parts of the same resource. In this extra proof requires time, and is prone to error.

RFC7233 allows one to page through a resource without creating new URIs for each resource. It is done for exactly the same reason as the proposed GET+Query. Without it there could be infinite number of pages for each original resource, which caches would be keeping around sometimes ignorant of the fact that these are the same pages of the original resource, and that they would have lost nothing by merging these pages together into one resource, to save place.

The use case for GET+Query is that for example for JS client that want to get parts of a resource to display quickly some information in the browser. But if it gets more information using another GET+Query on a later part of the Single Page Application, it will want to be able to merge the two pieces of information in the same local named graph, rather than having a number of different named graphs each with partial information hanging around. This is exactly the same use case as RFC7233, and that is why this is just an extension but at the semantic layer, rather than at the representation or binary layer that the currently defined implementation of RFC7233 promote.

In terms of <uri,mime-type,query> triples identifying each representation of a query - call them RQ3 - we can say:

for each RQ3 the template proposal creates a URL, whereas the GET+Query one does not
- This means that ACLs need only be written in the GET+Query use case against one URL, whereas in the Template solution ACLs need to be written against potentially infinite sets of URLs. This creates potentially a large number of ways of errors to introduce themselves on the server, in caches and on the client.
the number of representations passed between the client and server will be the the number of RQ3, and will be the same in the GET+Query case as in the template proposal.
In the GET+Query proposal it is part of the standard described at the level of HTTP that clients and caches are assumed to be wanting to merge the different representations. As a result:
- caches can as per protocol store the potentially infinite number of representations, one for each triple in only one graph. (For the template solution this requires out of band knowledge, potentially requiring caches to make extra http connections in order to discover this information - which is costly and in the case of caches not allowed )
- It should be possible ( and not necessary ! ) therefore for caches to build up more and more complete representations of the original resource and so in the end be able to answer them directly if e-tags are correct without making a request to the original resource.
- GET+Query falls back as does RFC7233 to simple GET if the Query is not understood, meaning that clients can in one go ask for the subset of the representation and if that is not supported get the full representation. In the template proposal the client first needs to
  1. make a request to find out if the resource supports a templating language
  2. find out about the templating language and pattern
  3. get the URI template
    This ends up making 3 requests rather than 1, and given that the speed of light is not changeable and as internet connections are millions of times slower than cpu connections this is not optimal for use cases where links are constantly jumping across web servers.

This does not mean that template URLs don't have their place. But for the use cases we are implementing in SoLiD the disadvantages outweigh the advantages.

elf-pavlik commented 9 years ago

RFC7233 allows one to page through a resource without creating new URIs for each resource. It is done for exactly the same reason as the proposed GET+Query. Without it there could be infinite number of pages for each original resource, which caches would be keeping around sometimes ignorant of the fact that these are the same pages of the original resource, and that they would have lost nothing by merging these pages together into one resource, to save place.

Let's take an example of this container https://twitter.com/timberners_lee/followers which contains 206K resources. How we would page it with: 1) URI Templates 2) GET+Query

Does it make sense to cache pages if list of followers often grows? In case of IRC or mailing list archives it makes a lot of sense to cache if we page them by day/month

But if each new addition (e.g. followers list) will change paging than I don't see such a big gain in caching it, at least not for any longer than some minutes / hours.

seeAlso:

bblfish commented 9 years ago

How would we page https://twitter.com/timberners_lee/followers with: 1) URI Templates 2) GET+Query

In the GET+Query case - or with SEARCH the client can decide in the query on how to page the resources. This ability of the client to decide on how to page resources is why these methods are so much more interesting to clients than simple paging specs like LDP Paging or even Activity Streams 2.0 paging. There the server decides on a the paging order irrespective of what information the client actually needs. In fact that is why the LDP Paging spec has not had much adoption, and why a number of LDP Working Group members ( of which Oracle and IBM ) are pushing for SEARCH. If the data served by Twitter allowed the client to use SEARCH or GET+Query - using whatever query language is appropriate - then the client could for example request just the followers of Tim Berners Lee that signed up in the last few weeks, or who lived in England, or,.... ( depending on the detail of the information made available there ). They could also potentially ask for the results to be returned in alphabetical order, or in other sort orders. The client could only request the amount of information needed given the size of the display of the client. All of this would be useful even if the data updates very quickly and even if caches keep being invalidated.

The same could be done with URI templates if they allowed full blown query languages to be placed in the URL - (which in any case is ugly). If they don't allow full blown query languages then the client is limited to queries allowed by the templating language, and as I pointed out in my previous response, finding out what those limitations are would come at the price of a number of http requests - which is costly.

The Templating answer requires the server to either to specify Cache-Control: no-cache or caches might cache many many times more information than they need to.

In short: the HTTP GET+Query solution is more flexible, requires less URIs, is better for caches, requires less HTTP requests, degrades gracefully, and does not exclude the template answer.

bblfish commented 9 years ago

I noticed that the previous solution with GET and the query header have been removed from SoLiD spec. I think they should be there as an optional feature.

It certainly makes sense for LDPRs that are not LDPCs. For SEARCH queries on LDPCs that could be very powerful, there is a stronger case for a proposal such as @RubenVerborgh of a template query which limits the queries to a specific subset.

csarven commented 2 years ago

https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-safe-method-w-body-02