spaziocodice / SolRDF

An RDF plugin for Solr
Apache License 2.0
113 stars 20 forks source link

Hybrd query returning incorrect "numFound" #108

Open chetrebman opened 8 years ago

chetrebman commented 8 years ago

THE DATA

http://coalliance.org/id/1 http://coalliance.org/siteCode/site_A "MK_1". http://coalliance.org/id/2 http://coalliance.org/siteCode/site_A "MK_2". http://coalliance.org/id/3 http://coalliance.org/siteCode/site_A "MK_3".

http://coalliance.org/id/6 http://coalliance.org/siteCode/site_B "MK_1". http://coalliance.org/id/4 http://coalliance.org/siteCode/site_B "MK_4". http://coalliance.org/id/5 http://coalliance.org/siteCode/site_B "MK_5".

THE QUERY

SELECT ?o WHERE { ?s http://coalliance.org/siteCode/site_A ?o. ?s2 http://coalliance.org/siteCode/site_B ?o }

THE SOLR RESPONSE

<result name="response" numFound="4" start="0" maxScore="1.0"> <head> <variable name="o"/> </head> <results> <result> <binding name="o"> <literal>MK_1</literal> </binding> </result> </results> ```

chetrebman commented 8 years ago

Here is another query with possibly the same issue SELECT ?object WHERE { ?subject http://coalliance.org/siteCode/site_A ?object. MINUS {?subject2 http://coalliance.org/siteCode/site_B ?object} }

`

MK_2 MK_3

`

agazzarini commented 8 years ago

Hi @chetrebman sorry for the long absence...I can confirm you: that is definitely a bug, and the bad news is that is not a trivial task...it is strictly related with the issue #96, in other words it has something to do with a Solr-optimized implementation of the SPARQL plan execution.

Technically: the numFound attribute, as you described, is reporting a wrong number because it is doing a kind of union with all docsets resulting from the query evaluation. The query evaluation consists of several steps, that are translated in several Solr queries. The current implementation provides just the primitives for working with triples (i.e. add, remove and query); on top of a SPARQL query, each time an underlying (Solr) query is executed (as result of the execution of some part of the algebra plan) the resulting docset (the set of matching documents identifiers) is collected, adding them to the previous docset. This kind of "collection" operation (i.e. the union) is not valid in general because sometimes the incoming docset should replace the previous one, sometimes an intersection has to be done, sometimes a union is the right thing to do. Unfortunately the current implementation cannot know what kind of "collection" operation needs to be done...and (wrongly) executes a union.

In other words, the number "4" in your example means that in order to execute that SPARQL query, the processor executed n queries and worked with a total number of (matching) 4 documents. So while this is a right measure, it is unuseful, as the number you (and me) would like to see is the total count of outcoming query solutions.

I'm still fighting with issue #96 and thinking about how to end up with this.

Andrea

agazzarini commented 8 years ago

Following the same thread, I paste the exchange with another user.

"Hi, How to add Facet option in curl query, with example of bsbm-generated-dataset.nt, I tried curl "http://127.0.0.1:8080/solr/store/sparql" --data-urlencode "q=SELECT ?product ?label WHERE { ?product ?p ?label.} ORDER BY ?label LIMIT 10 &facet=true&facet.field=product" -H "Accept: application/sparql-results+json" But it did not work."

"A first premise: as you can read here [2] what I called "Hybrid mode" has been temporaily disabled in the current version of SolRDF so everything below is related to SolRDF 1.0 (which is in a dedicated branch and runs on top of Solr 4.x)

A second premise: it's not your case (read below) but faceting is not working on SolRDF mainly because the issue related with the SPARQL algebra (I don't remember exactly the number).

Having said that, "it's not your case" because your SPARQL query contains just one triple pattern, and in this (only) case you can get (using SolRDF 1.0) some facets back from SolRDF. However, seeing your example, things are not working as you might expect: you don't have a "product" field but just s(ubject), p(redicate) and o(bject) fields, so for a plain field faceting you can only use one of them.

I suggest you to have a look at my first post about SPOC faceting[1] and the SolRDF Wiki [2] as well. There, especially in the wiki, you can find the several kinds of faceting that "should" be available and how they "should" work, with examples and command lines. "Should" means remember: at the moment things are working only if you have one simple triple pattern in your SPARQL."