openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
863 stars 210 forks source link

COUNT with rdfs:subClassOf* statement may be incorrect #762

Open AndreaWesterinen opened 6 years ago

AndreaWesterinen commented 6 years ago

Processing a query with a SELECT statement that includes COUNT(?s) or COUNT(*) and a sub-class check (for the counted variable) may return a much larger count value than exists.

For example, comparing results when the following query is executed ...

PREFIX event: <http://example.com/ontology/odps/Event#>
SELECT (COUNT(?s) as ?count) ?type
where { 
      ?s a ?type . ?type rdfs:subClassOf* event:Event
} GROUP BY ?type

Versus if the SELECT statement is changed to SELECT (COUNT(distinct ?s) as ?count) ?type ... results in most COUNT values being consistent across the two queries - except when the the type of the individual has multiple super-classes.

In our data set, the type, event:Wedding, has two super-classes - event:Ceremony and event:PersonalLifeEvent. For this type, the first query returns a count of 136, whereas the second query returns 68 (a difference of a factor of 2). Doing a manual count for verification purposes, there were indeed 68 Wedding individuals in the test data store.

It appears that an individual is counted once for each path "down" from event:Event. Since there are two paths, there are 2x the number of individuals. If there are more than 2 super-classes, then the multiplier will be larger.

It seems that the query results are not flattened to only count an individual once. Changing the statement to COUNT(distinct ?s) or COUNT(distinct *) forces that flattening.

TallTed commented 6 years ago

If I understand your writeup correctly, then these results are as expected, and DISTINCT is the correct way to address this, for the data as you've described it.

The COUNT() function evaluates the rows in your result set.

If you query without the COUNT(), i.e. —

SELECT ?s ?type

— I believe you'll find 136 rows with event:Wedding, while if you query without the COUNT() and with the DISTINCT, i.e. —

SELECT DISTINCT ?s ?type

— I believe you'll find 68 rows with event:Wedding.

You might also consider tweaking your ontology, as class inheritance is usually one super-class to multiple sub-classes, and I think you're likely to run into other unexpected issues with the inverted tree you've described. Here, I wonder whether event:Wedding might be subclass of event:Ceremony while event:Marriage might be a better subclass of event:PersonalLifeEvent? Alternatively, event:Ceremony might be a subclass of event:PersonalLifeEvent -- and event:Wedding thus only directly a subclass of event:Ceremony ... There are other possibilities.

AndreaWesterinen commented 6 years ago

How does that make sense? There are 68 individuals of type, event:Wedding.

If I do the query without the subClassOf* event:Event statement, there are 68 individuals.

Only in very restricted ontologies should an individual be limited to a single superclass. If Virtuoso is designed with that assumption, then this is a serious limitation.

TallTed commented 6 years ago

Entities are not restricted to a single class nor superclass, not by Virtuoso nor in general. I did not intend to imply such.

The reason for the apparent multiplication is that property paths are syntactic sugar, and make complex things appear simple.

Your query is asking for every ?s ?type pair where there is a { ?type rdfs:subClassOf [ ?type1 . ?type1 rdfs:subClassOf [ ?type2 . ?type2 rdfs:subClassOf [ ... ?typeN . ?typeN rdfs:subClassOf ] ] ] event:Event } property path. Each different path (that is, each path where there is any difference in any of the ?typeN in the chain) is a different solution. DISTINCT is the appropriate way to do the "flattening", as you put it, when multiple paths produce the same ?s ?type pair.

Why do you think this behavior is incorrect?

TallTed commented 6 years ago

@AndreaWesterinen - Did my latest update answer your remaining questions/concerns on this matter? Please let us know, either way, so we can take appropriate next steps.

AndreaWesterinen commented 6 years ago

@TallTed The definition of COUNT in SPARQL 1.1 is:

xsd:integer Count(multiset M)
N = Flatten(M)
remove error elements from N
Count(M) = card[N]

The cardinality of my ?s is 68. And, I have tried this query against other data stores (such as Stardog) and they all return 68. I still believe that this is a bug.

Andrea