w3c / web-annotation

Web Annotation Working Group repository, see README for links to specs
https://w3c.github.io/web-annotation/
Other
142 stars 30 forks source link

Support for search #48

Closed azaroth42 closed 8 years ago

azaroth42 commented 9 years ago

The protocol should support search and retrieval of annotations according to user/client specified criteria (a query).

This is a tracker issue for progress, which will involve at least the following steps:

azaroth42 commented 9 years ago

One example of an Open Annotation based annotation search specification: http://search.iiif.io/api/search/0.9/

akuckartz commented 9 years ago

If the data is provided using Triple Pattern Fragments (http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/) then no additional specification is needed because SPARQL can be used on the client.

azaroth42 commented 9 years ago

Unless I've missed something, TPF is still just a single author draft associated with a community group and has no formal standing. So we can't normatively refer to it.

And requiring SPARQL in a browser app ... if we were working that deep in the RDF stack, we wouldn't be having all these discussions! :)

azaroth42 commented 8 years ago

Proposal: Review the IIIF Search API (http://search.iiif.io/api/search/0.9/) down to section 3.4.1 as it follows the same patterns as ActivityStreams Paged Collections. If we can bootstrap from that, then we have a starting point. If not, we should discuss whether we can reasonably deliver anything in the search space before the end of the charter.

iherman commented 8 years ago

@azaroth42 (in response to your proposal).

I am not yet in position to say yay or nay to your original question, just musing here to, hopefully, add arguments to the issue for further discussions.

  1. If we go with the IIIF Search API as a starting position, I believe that main (only) part we would take over is the equivalent of section 3.2. Indeed, I would expect the returned data should be exactly the same as the response in our protocol spec, i.e., what is described in the annotation containers, rather than what is in IIIF. Maybe the only thing that I would think about taking over, too, is the search:Hit object, so to say, i.e., an additional information in the returned annotation that gives, essentially, the selector that lead to that particular response. Other than that, we should not have a different response to a query than what we already have for direct request. (We may have to add some additional parameters in the response header via Link, I am not sure.)
  2. Looking at section 3.2 and, in particular, the table describing the query arguments, I would
    1. Add the role argument alongside the motivation (well, whatever the the name of that thing will be as an outcome of issue #112)
    2. I am not sure that the user and the box arguments have a role to play for us

If we adopt this, there are questions that we will have to answer, though. Some that come to my mind:

  1. One aspect is not clear to me in IIIF. The spec suggests that the value of the q argument could be a regular expression (there is a b* example somewhere down the line). Which would be fine, but this is not what the spec says:

    A space separated list of search terms. The search terms may be either words (to search for within textual bodies) or URIs (to search identities of annotation body resources).

    What is exactly the situation in IIIF? (B.t.w., why space separated? Shouldn't it be comma separated for a URI?) What would be the useful thing for us, i.e., would we want regex or something else?

  2. What happens if the annotation includes several bodies? Do we return an annotation with only the selected body, or do we return the full annotation? In some sense, is the original annotation the indivisible entity, or can a server produce a bona fide annotation by restricting it to the selected body?
  3. Don't we need a similar query on the target rather than the body? I.e., instead of a q parameter, having something like t for targets, searching either to the target's URI or word in case a selector is used.
  4. On a more general level, would that be the search facility for the annotation protocol? I think even if we decide to go ahead, it should be formulated as one of the possible search formulations, and that implementations are free to adopt other, more powerful facilities. While I agree with your response to the comment of @akuckartz, we may want to use a more general framework, saying
    • implementations may define/use other search facilities (although it must implement what we define as the basis)
    • the return for all the search facilities should follow the annotation containers part of the spec for their return.
iherman commented 8 years ago

@azaroth42 will provide a straw version

The WG will consider a separate document defining a non-exclusive search interface to be published at least as a Note and potentially part of Protocol

First version would come around 15th of January, '16

Telco: http://www.w3.org/2015/12/16-annotation-irc#T16-57-30

gsergiu commented 8 years ago

Hi all,

I'm working on implementing a serch api based on lucene/solr (http://lucene.apache.org/solr/) which is a kind of de facto standard for text/metadata search.

  1. I think it is important to take a look at solr query syntax, advance search functionality (like: faceted search, query filtering, query correction suggestions, etc.) and the solr admin functionality in order to gather proper requirements for the Search Annotations API.
  2. There are two (or more) types of search scenarios that I would expect to be supported by the search API: a) free text search .. ( i.e. like basic google search) b) metadata based search (i.e. like amazon search http://www.amazon.com/s/ref=sr_nr_p_n_shipping_option-_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A17%2Ck%3Aannotations%2Cp_n_feature_twelve_browse-bin%3A10159408011%2Cp_n_feature_nine_browse-bin%3A3291437011%2Cp_n_shipping_option-bin%3A3242350011&keywords=annotations&ie=UTF8&qid=1451924254&rnid=2944662011 )

3) If it is about the design of the API interface, I find the freebase MQL to be the easiest to use, json based, query language https://developers.google.com/freebase/v1/mql-overview?hl=en . unfortunately ... the freebase development was discontinued through wikidata project and the MQL interface is not accessible anymore ... Still, I find very good the idea of using Json input and Json output, where the properties of the query language are the same as those of the model ..

Br, Sergiu

gsergiu commented 8 years ago

@iherman Some feedback from my experience with (metadata) search engines:

  1. The search API, typically return a preview (i.e. a view in the model-view-controller design) of the annotations and not the full objects. Some solutions allow user to select which properties to be returned (e.g. SQL fashion), others use predefined profiles (e.g. a minimal/standard/full). The number of total results must be present in each request, and also the "pagination" information (i.e. in solr uses start and rows parameters).
  2. I think that all properties of the annotations must be searchable/indexed ... and search according to their data types (e.g. creation data, creator, tag labels, even the generator)
  3. regarding the definition of the "q" parameter. This is actually a string serialization of a query, which might be a simple text query or it might be a kind of metadata-based query (in which case the query must be often URL-Encoded) ... see also lucene query syntax: http://wiki.apache.org/solr/SolrQuerySyntax
  4. about the concern about multiple bodies in the search response, I think this can be solved through usage of search profiles as mentioned in 1.
  5. About the search in target, yes .. it is a must to implement the search by target. Given the 3. above, the q parameter is the serialization of the full query, while target is only one property of the query. In json we could have a query like this: "q" : { "creator" : "myself", "target" : "http://mywebsite.com/myobject" } The submition of queries as json format with POST method, would be quite good for developers. Still for many cases GET is preferred .. therefore ... the ugly string serializations like in the amazon example in previous post ....
  6. I think that the specification of the search API is bound to API technology. If the API is REST + JSON-LD, the input should be json(-ld). If the API is RDF/XML based, this might use xquery or something like that (which can be translated internally in a sparql query if needed). The query is the input of the search, as the annotation is the input of the create annotation method. I think that the two APIs should be consistent.

BR,

Sergiu