Open AndreaWesterinen opened 6 years ago
If I understand your writeup correctly, then these results are as expected, and DISTINCT
is the correct way to address this, for the data as you've described it.
The COUNT()
function evaluates the rows in your result set.
If you query without the COUNT()
, i.e. —
SELECT ?s ?type
— I believe you'll find 136 rows with event:Wedding
, while if you query without the COUNT()
and with the DISTINCT
, i.e. —
SELECT DISTINCT ?s ?type
— I believe you'll find 68 rows with event:Wedding
.
You might also consider tweaking your ontology, as class inheritance is usually one super-class to multiple sub-classes, and I think you're likely to run into other unexpected issues with the inverted tree you've described. Here, I wonder whether event:Wedding
might be subclass of event:Ceremony
while event:Marriage
might be a better subclass of event:PersonalLifeEvent
? Alternatively, event:Ceremony
might be a subclass of event:PersonalLifeEvent
-- and event:Wedding
thus only directly a subclass of event:Ceremony
... There are other possibilities.
How does that make sense? There are 68 individuals of type, event:Wedding
.
If I do the query without the subClassOf* event:Event
statement, there are 68 individuals.
Only in very restricted ontologies should an individual be limited to a single superclass. If Virtuoso is designed with that assumption, then this is a serious limitation.
Entities are not restricted to a single class nor superclass, not by Virtuoso nor in general. I did not intend to imply such.
The reason for the apparent multiplication is that property paths are syntactic sugar, and make complex things appear simple.
Your query is asking for every ?s ?type
pair where there is a { ?type rdfs:subClassOf [ ?type1 . ?type1 rdfs:subClassOf [ ?type2 . ?type2 rdfs:subClassOf [ ... ?typeN . ?typeN rdfs:subClassOf ] ] ] event:Event }
property path. Each different path (that is, each path where there is any difference in any of the ?typeN
in the chain) is a different solution. DISTINCT
is the appropriate way to do the "flattening", as you put it, when multiple paths produce the same ?s ?type
pair.
Why do you think this behavior is incorrect?
@AndreaWesterinen - Did my latest update answer your remaining questions/concerns on this matter? Please let us know, either way, so we can take appropriate next steps.
@TallTed The definition of COUNT in SPARQL 1.1 is:
xsd:integer Count(multiset M)
N = Flatten(M)
remove error elements from N
Count(M) = card[N]
The cardinality of my ?s is 68. And, I have tried this query against other data stores (such as Stardog) and they all return 68. I still believe that this is a bug.
Andrea
Processing a query with a
SELECT
statement that includesCOUNT(?s)
orCOUNT(*)
and a sub-class check (for the counted variable) may return a much larger count value than exists.For example, comparing results when the following query is executed ...
Versus if the
SELECT
statement is changed toSELECT (COUNT(distinct ?s) as ?count) ?type ...
results in mostCOUNT
values being consistent across the two queries - except when the the type of the individual has multiple super-classes.In our data set, the type,
event:Wedding
, has two super-classes -event:Ceremony
andevent:PersonalLifeEvent
. For this type, the first query returns a count of 136, whereas the second query returns 68 (a difference of a factor of 2). Doing a manual count for verification purposes, there were indeed 68Wedding
individuals in the test data store.It appears that an individual is counted once for each path "down" from
event:Event
. Since there are two paths, there are 2x the number of individuals. If there are more than 2 super-classes, then the multiplier will be larger.It seems that the query results are not flattened to only count an individual once. Changing the statement to
COUNT(distinct ?s)
orCOUNT(distinct *)
forces that flattening.