Open kasei opened 5 years ago
As an aside, I'll note that the examples in the quoted text from §15.1 use invalid syntax for the language tags. That should probably be fixed, too.
(The invalid language tags are already mentioned on our errata page.)
Why not have the spec define a simple total ordering (e.g., alphabetically by language tag, and then by Unicode codepoint within each language tag), and state that processors MAY deviate from this ordering where doing so improves i18n?
Dydra provides a collation-based sorting. It uses language tags only, no extensions required. @lisp has probably much more to say about this.
. . . and state that processors MAY deviate from this ordering where doing so improves i18n?
How would such a deviation improved i18n? Can you give an example? If we can remove functional variability between implementations I would much prefer it.
@dbooth-boston Well, just saying “i18n” is a bit narrow actually. I should have said “MAY deviate where it better matches user expectations” or something like that. Examples include:
@en
together with @en-uk
Some processors want to go all in here and support language-specific collations and so on, but for other implementers this would be a very heavy burden. Hence my proposal of simple sort-by-language-tag followed by sort-by-lexical-form-codepoints, while allowing processors to do something fancier if they want.
Perhaps the collating order should be controlled by locale environment variables, just as it is for other standard tools such as sort.
Re. collations, I think as much as possible should be reused from XPath and XSLT: Sorting Using Collations
Why not have the spec define a simple total ordering (e.g., alphabetically by language tag, and then by Unicode codepoint within each language tag), and state that processors MAY deviate from this ordering where doing so improves i18n?
My concern here is twofold:
Perhaps the collating order should be controlled by locale environment variables, just as it is for other standard tools such as sort.
I'm primarily concerned here with making sure that paging works predictably. Adding requirements to support collation would, as @cygri notes, be a significant burden on some implementations.
@kasei
I think using MAY language would lead to a situation where users might unintentionally rely on the widely-supported codepoint ordering, and then be very surprised when they run across an implementation that does it another way. In the worst case, that surprise might take the form of things breaking.
Another possibility would be to strictly mandate codepoint ordering. Implementations that want to offer collation support can do so by using a pseudo-function:
ORDER BY ex:collation("en_EN", ?name)
(ex:collation
would not be a real function, but a marker that requests a particular collation by mildly abusing function syntax. One could think of it as being similar to DESC(?x)
which looks like a function but isn't.)
@namedgraph , it looks like W3C has only defined two collation URIs: codepoint and html-ascii-case-insensitive. And the latter does not define a sorting order -- it only defines case-insensitive equality: ". . . any use of this collation for sorting strings produces implementation-defined results." Are there more standard XPath collation URIs defined elsewhere? If so, where and what?
Regarding codepoint ordering versus full collation support, I do not have a strong preference either way, as long as there is a reasonably smooth future path for collation support, since that is probably where we will eventually want to go, in order to be consistent with other standard tools such as sort.
SIDE COMMENT: Personally, I have always hated the fact that sort by default uses the locale instead of codepoint, because it means that unless I am careful to remember to explicitly set the locale in my scripts, the sort results are unpredictable from one installation to another. Grrr.
Regarding codepoint ordering versus full collation support, I do not have a strong preference either way, as long as there is a reasonably smooth future path for collation support, since that is probably where we will eventually want to go, in order to be consistent with other standard tools such as sort.
This seems sensible. I personally think attempting to address full collation is too much for 1.2 (and possibly too much of an implementation burden in general), but that we shouldn't preclude its possible inclusion in the future.
I also want to be sensitive to the fact that as an English speaker I might not be fully aware of all the ways in which enforcing codepoint ordering might be a problem for other people with other datasets (e.g. CJK data).
doing this half-way makes little sense. that is, either
or the retitle the issue to leave out "usability" and "predictability".
This seems sensible. I personally think attempting to address full collation is too much for 1.2 (and possibly too much of an implementation burden in general), but that we shouldn't preclude its possible inclusion in the future.
we use a very thin wrapper over ucol_strcollUTF8. if a recommendation were to require wider support, that would entail more work. for that limited support, if one can use that library and tolerates that dependency, the test itself is about 20 lines of code more than treating the strings as incomparable.
or the retitle the issue to leave out "usability" and "predictability".
@lisp The primary issue I'm trying to improve here is paging, and that is currently impossible to use predictably. I believe that a small change to the language of the spec could make a big usability improvement here, by allowing users to use paging on data whose sorting is currently undefined (e.g. all language-tagged literals).
We could certainly talk about adding full collation, and that would subsume this issue, but that was not the point of this issue.
we use a very thin wrapper over ucol_strcollUTF8 ... for that limited support, if one can use that library and tolerates that dependency, the test itself is about 20 lines of code more than treating the strings as incomparable.
The implementation experience is good to know about, but ICU is a significant dependency and may not be appropriate for all implementations.
@lisp The primary issue I'm trying to improve here is paging, and that is currently impossible to use predictably. I believe that a small change to the language of the spec could make a big usability improvement here, by allowing users to use paging on data whose sorting is currently undefined (e.g. all language-tagged literals).
i would resist ratifying a limited change. were it the recommendation, an accurate sort would then either be nonconformant, which i would not recognize as an advantage, or require declaration/configuration, which would be source of its own problems.
@lisp as described, wouldn't your implementation already conform with the proposed language?
not as i understand https://www.w3.org/TR/xpath-functions-31/#codepoint-collation .
@lisp The proposal here is to add text about providing a consistent ordering, not necessarily any particular ordering. As described, I think your implementation already does that.
Codepoint collation was mentioned as a possible alternative that might address the same use case, but at a higher cost (both implementation cost and by running the risk of working against implementations that already use some other collation).
What would be a sensible order when numbers and dates are mixed with strings? Graphdb Workbench has a feature "click on col header to sort" and I think we put numbers before dates before strings or some such
What would be a sensible order when numbers and dates are mixed with strings?
as long as it is consistent across implementations, is one order better than another. in any case, it should cover all types which are eventually added via #93.
about providing a consistent ordering, not necessarily any particular ordering
if consistent is to be across implementations then the string element matters.
if consistent is to be across implementations then the string element matters.
My concern isn't consistency across implementations, but about consistency across requests to the same implementation.
by that criteria, the implementation could sort according to internal term identifiers without regard to the lexical forms.
by that criteria, the implementation could sort according to internal term identifiers without regard to the lexical forms.
So long as that sorting also respected the partial ordering specified by SPARQL, yes, I'd be fine with that. In reality, it's nearly impossible to assign internal IDs so that they also respect SPARQL ordering, but you can get part-way there by partitioning different term types using high bits of the ID integer.
the partial order specified by sparql follows from type only. by that standard, criteria which completely ignore the nature of the respective value domain should be a reasonable approach for sparql?
the partial order specified by sparql follows from type only. by that standard, criteria which completely ignore the nature of the respective value domain should be a reasonable approach for sparql?
No, because the value space of numeric terms has to be respected.
that is "the value space of numeric terms has to be respected", but for string terms the order with respect to other types should follow the partial order in 15.1, while that within their value domain could be according to any stable surrogate values?
If I understand you correctly, then yes, that sounds right. 15.1 must be respected which includes the partial order of term type and then some specific datatype ordering such as numerics and xsd:string (anything in the operator mapping table). Beyond that, internal identifiers could be used to provide a fallback comparable value for cases where 15.1 is silent.
I didn't read the above very carefully. Could someone confirm or deny that given a variety of numbers, they'll all sort in a global order? eg
"1"^^xsd:integer
"1"^^xsd:int
"1.1"^^xsd:decimal
"1.11"^^xsd:float
"2"^^xsd:int
if your question is whether the '<' operator defines the relation between
"1"^^xsd:integer "1"^^xsd:int
i do not understand 17.3 to define their order.
@VladimirAlexiev That is a higher hurdle than predictability/"consistent order" from a specific endpoint.
The motivation of paging only needs the same ordering at a given endpoint, not a specific ordering nor the same ordering everywhere.
Paging maybe better done other, more complicated ways such as being part of the protocol, e.g. #49 (which notes the inefficiency of repeat requests with moving slice) or #84 (a chance to design a protocol for specifically for SPARQL).
Consistent ordering, either with or without streaming, provides the ability to page results.
So the spec issue could be as little as "Implementations SHOULD order RDF terms into the same order for all requests".
FWIW: Jena sorts into a consistent order - by value (spec required), then by term kind: "Blank nodes < URIs < Literals" and then by RDF term detail (literals by lexical form and then datatype then language tag). It gets quite arbitrary but it is the same each time.
There is no single "right" answer for all use cases once you get passed values. (e.g. a single fixed order vs collation by language can't have both at the same time).
if the sentiment is to refocus this topic to limit it to the "paging" use case, then please rewrite the description to eliminate, for example, the issue, that
difficult to portably work with any RDF data that heavily uses language-tagged literals
if the issue is limited to reproducible paging, it may not even be the case that any form of sorting is involved.
if the sentiment is to refocus this topic to limit it to the "paging" use case, then please rewrite the description to eliminate, for example, the issue, that
difficult to portably work with any RDF data that heavily uses language-tagged literals
@lisp – I included the example of language-tagged literals because it perfectly illustrates the problem with paging. If your data is not comparable w.r.t. SPARQL (as in the case of all language-tagged literals), then paging cannot be used reliably, because there are simply no guarantees that the order will be consistent.
where incomparable terms lead to order variations between repeated, identical query expressions, incomparability alone is not the cause. will the result not be consistent, if the respective combination operators are deterministic and any sort is stable?
where incomparable terms lead to order variations between repeated, identical query expressions, incomparability alone is not the cause. will the result not be consistent, if the respective combination operators are deterministic and any sort is stable?
I'm afraid I don't understand this. Can you expand or rephrase?
the issue cites
challenges for accessing data predictably (e.g. when paging results with LIMIT+OFFSET)
and describes that the incomparability of language-tagged literals "means that it is difficult to portably work with any RDF data" which uses them.
i would expect that, if the initial statement pattern match/scan operators and all combination operators are deterministic, then an unsorted query should be repeatable. if any sort operators are stable, incomparable values should not move in results and the sorted query should also be repeatable. in other words, comparability is not itself the issue.
i suggest that, if something is to be changed about sorting, it should be to make the result more useful, which includes at least to define the relation among all possible string values, if not to provide the kind of complete ordering described wrt jena, above.
i would expect that, if the initial statement pattern match/scan operators and all combination operators are deterministic, then an unsorted query should be repeatable.
Operators are not always going to be deterministic in the order they produce results. For example, changes to indexes and randomized in-memory structures are two reasons that query results might be produced in different orders.
if any sort operators are stable, incomparable values should not move in results and the sorted query should also be repeatable. in other words, comparability is not itself the issue.
SPARQL doesn't say anything about sort stability being required.
i suggest that, if something is to be changed about sorting, it should be to make the result more useful, which includes at least to define the relation among all possible string values, if not to provide the kind of complete ordering described wrt jena, above.
If by "all possible string values," you are including language-tagged strings, I would think that is too large and fraught a task for this CG. This is an issue which will be implementation/deployment specific. However, it could be proposed as a separate issue (which if accepted would subsume this one) so that it could be discussed on its own merits.
Operators are not always going to be deterministic in the order they produce results. For example, changes to indexes and randomized in-memory structures are two reasons that query results might be produced in different orders.
yes. by which assessment you are proposing to change one operator to correct an issue with others.
The original issue was written about queries with ORDER BY
clauses, which may unexpectedly produce incompletely and inconsistently ordered solution sets, even from the same instance due to the incompletely specified sorting in the original spec. Improving on this seems worthy.
That said, this is one of the areas which has long been problematic in the SQL RDBMS world, both across engines and across instances of the same engine, which may be set to different collations among other LOCALE
-associated settings. It will be important to consider and understand the issues that have come up there, in order to apply the lessons learned here -- instead of repeating mistakes.
That said, the last few comments here seem to be asking that the order of the solution set for a query that lacks an ORDER BY
clause should be identical every time it's run on the same instance, whether within seconds, minutes, weeks, or longer.
Solution discovery and hence delivery order may change for a number of reasons, in both short and long time frames. Default solution output order is intentionally undefined, and I believe it should remain so. Forcing an ordering in the way suggested may be a VERY heavy load, and in fact, requires that a partial solution set never be delivered except after the full solution set is found. The choice to add an ORDER BY
clause and its attendant load is (or should be) an informed choice by the user, not an imposition by the SPARQL standard.
Forcing an ordering in the way suggested may be a VERY heavy load, and in fact, requires that a partial solution set never be delivered except after the full solution set is found.
please reiterate: which "way suggested"?
@lisp
A DBMS may discover solutions of a query in any order, which order of discovery may change between executions of the same query due to many factors including (but not limited to) --
In other words -- the discovery order is not inherently repeatable.
Because there is no viable way to force the discovery of solutions to always occur in the same order, there is no way to force the delivery of those solutions to always occur in the same order -- except by declaring that every query processor must always apply an ORDER BY
-- even if that ORDER BY
is internally defined by the processor, even on a per-query basis -- before delivering the solutions -- which by necessity means before delivering any solutions.
where "that ORDER BY [which] is internally defined by the processor" causes solutions to always occur in the same order, even where the terms which are subject to comparison between two respective solutions happen to be the same term?
@lisp - I do not understand your comment.
this issue discusses solution ordering guarantees and how to achieve them. the initial description reports deficiencies in the current specification for ORDER BY, in particular with respect to rdf:langString terms, and proposes that minimal changes would suffice to permit paging. a later comment suggests that, as it is impossible to "force the delivery of those solutions to always occur in the same order", in order to provide a guarantee, "every query processor must always apply an ORDER BY".
i called into question the value of a change which improves the partial order, but does not make it complete. i also question whether a complete order would resolve the issue.
consider in the following queries, what ordering guarantees are expected, why, and what role does sort play in them?
select ?string ?number
where {
values (?string ?number) {("a" 1) ("a" 2)}
} order by ?string
https://dydra.com/james/foaf/@query#ex-with-order (the sort is not stable)
select ?string ?number
where {
values (?string ?number) {("a" 1) ("a" 2)}
}
https://dydra.com/james/foaf/@query#ex-without-order (delivery should repeat)
whereby, a VALUES clause is not the only source for which a repeatable discovery order is viable.
@lisp - I don't think I understand your point.
I think your sample queries remind us that ORDER BY
only has impact to the extent that the ORDER BY
clause includes the variables in the SELECT
-- i.e., a query that SELECTs
variables that are not included in the ORDER BY
clause will not necessarily have its solutions delivered in predictable nor consistent order.
I think your queries also remind us that solution discovery order does not necessarily impact the order of solution delivery.
But how does this relate to the basic point of this issue, which remains a wish to improve (if not complete) the specification of relative ordering within ORDER BY
?
the values bound to ?string in the example are equal. the sort operators of which i am aware require a consistent precedence relation between respective values in order to produce a predictable result. under those conditions, the equal values will be treated the same as incomparable values. the effect is that, in the general case, that is given equal values, changes to the ORDER BY operator will not resolve the problem at issue.
@kasei - I wonder whether your concern might be (partially) addressed by advising users who need to care about such things that they might consider using constructs like --
ORDER BY ASC ( LANG ( ?langtaggedvar ) )
ASC ( STR ( ?langtaggedvar ) )
-- or --
ORDER BY ASC ( STR ( ?langtaggedvar ) )
ASC ( LANG ( ?langtaggedvar ) )
Clearly, this doesn't get into internationalized collation ordering ... and at first I thought perhaps that could be addressed by a new optional argument to the ASC
/DESC
, as --
ORDER BY ASC ( STR ( ?langtaggedvar ), collation_identifier )
-- and then I went looking for previous work, and oh my what a mess.
@lisp -- Are you arguing with my points, or agreeing?
@kasei's original post could have been clearer in the issue description (the Why?), by noting that it was focused only on ORDER BY
-- but I think this is made clear by the Proposed solution --
ORDER BY
over values with a (currently) undefined order SHOULD cause results to have consistent ordering
Note -- the above says nothing about values outside the ORDER BY
, on which it appears you are focused.
Why?
There are several cases where the current spec does not provide a total ordering over RDF terms, and therefore causes challenges for accessing data predictably (e.g. when paging results with LIMIT+OFFSET). SPARQL 1.1 §15.1 says, in part:
The second point here is especially interesting, as it means that it is difficult to portably work with any RDF data that heavily uses language-tagged literals.
Previous work
Many implementations already seem to produce a consistent ordering over data for which SPARQL ordering is undefined.
Proposed solution
I believe that the SPARQL spec should add text stating that
ORDER BY
over values with a (currently) undefined order SHOULD cause results to have consistent ordering, even if that order is not explicitly defined by SPARQL. This will allow clients to useLIMIT
/OFFSET
paging over such data. This might also be paired with a Service Description Feature indicating support for such consistent sorting.Possible (partial) alternatives include:
fn:compare
on which SPARQL ordering depends, and they are identified by URIs)ORDER BY CONSISTENT ?name
) without requiring any particular orderingConsiderations for backward compatibility
This is a suggestion to include SHOULD normative language about ordering data in cases where currently no ordering is defined. This should not have any effect on backwards compatibility.