Open kasei opened 5 years ago
You can achieve this already with a little more syntax by using a sub SELECT
with the LIMIT
on to return the subjects, then not setting a limit on the outer query.
@RickMoynihan for specific cases that's true, but I believe window functions can more naturally express some of these use cases and definitely increases expressivity for others. I used this specific example only because it was the one used in the original LimitPerResource feature request for SPARQL 1.1.
(Updated the example to use ROW_NUMBER
instead of RANK
because this example doesn't use partition ordering. However, RANK
is an example of a function that would be hard to approximate using just SPARQL 1.1 due to its handling of ties.)
window functions can more naturally express some of these use cases and definitely increases expressivity for others
please describe such cases.
@dydra I believe the example above is a prime example where it is more naturally expressed using a window function; having to duplicate the BGP in this query simply to pull out a distinct list of posts would be syntactically much more verbose, and this is a fairly straightforward example. SPARQL 1.1 saw precedent for adding features that in part allowed simplifying existing queries that were unnecessarily complex (property paths, negation, etc.). While I think window functions' increased expressivity alone make them worthwhile to discuss for inclusion, the simplification of some types of queries would also represent a big improvement.
Also, as mentioned, computing a moving average is an example that cannot be addressed with SPARQL 1.1. Another common window function I've used in production and would love to see come to SPARQL is quantile computation.
@dydra @RickMoynihan
Apologies. In my haste to compile and submit issues today I seem to have botched the LimitPerResource example as well as my (obviously untested) query showing how to apply window functions to it. This was compounded by just skimming the original feature request page for SPARQL 1.1, which didn't really cover the real use cases as I have understood them in the intervening years.
My current understanding of the "limit per resource" use case would be this: you want to limit the number of results you get from the query for each distinct value of some set of variables. For the SIOC posts example that started this off, this would be something like "for each post, I'm interested in getting data for at most 3 authors." I hope it will be clear why this is a more interesting and complicated example. For this, window functions can be used by partitioning the results by the ?post
variable, assigning a rank within each partition ordered by author bound to a new variable ?rank
, and then filtering all results such that ?rank <= 3
.
I hope that helps to clarify how window functions could address this specific use case. I'll try to update the example in the issue text to more accurately represent this use case. As I said, though, I think their value goes well beyond this use case, and can provide some nice features that we currently don't have.
@kasei Fully support. Can you craft more examples of the uses you mentioned, eg moving averages?
@VladimirAlexiev
Again, using the strawman syntax I've implemented, here are couple of examples:
# Moving average over the past 4 quarters
SELECT ?quarter ?value (AVG(?value) OVER (ORDER BY ?quarter ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ?movingAverage) WHERE {
VALUES (?quarter ?value) {
("2017Q1" 1) # 1.0 = avg(1) ; only 1 Q of data
("2017Q2" 2) # 1.5 = avg(1, 2) ; only 2 Q of data
("2017Q3" 1) # 1.33 = avg(1, 2, 1) ; only 3 Q of data
("2017Q4" 2) # 1.5 = avg(1, 2, 1, 2)
("2018Q1" 3) # 2.0 = avg(2, 1, 2, 3)
("2018Q2" 2) # 2.0 = avg(1, 2, 3, 2)
("2018Q3" 2) # 2.25 = avg(2, 3, 2, 2)
("2018Q4" 4) # 2.75 = avg(3, 2, 2, 4)
("2019Q1" 5) # 3.25 = avg(2, 2, 4, 5)
}
}
ORDER BY ?quarter
# 3 photos from each country
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?image ?country WHERE {
?image a foaf:Image ;
dcterms:coverage [ foaf:name ?country ; dcterms:type "Country" ] ;
.
}
HAVING (ROW_NUMBER() OVER (PARTITION BY ?country) <= 3)
ORDER BY ?country
I can't guarantee it's uptime, but these can be run at this endpoint.
I was the source of LimitPerResource way back before the SPARQL 1.1 work started, and it was a part of the initial work of the WG, but didn't make it in, something I have suffered ever since.
The initial use case was that we had a database of articles, where each article could be categorized into a number of categories and have any number of authors, and it had a bunch of optional fields. All this could yield a cyclic graph with a bit of heterogeneity when we CONSTRUCT
ed results, and what we found (this was before subqueries too), was that while SPARQL 1.0 would often work over heterogeneous data, but apart from OPTIONAL
, wasn't very well adapted to it. Subqueries helped quite a lot, but I noticed when moving from industry to academia, that students didn't think of subqueries when they were faced with this problem:
you don't care about the number of solutions to query, you care about the number of "things"
So, the subquery route to achieving such a simple requirement quickly turned out to be very developer unfriendly. We had a hard time implementing paging of results as pure LIMIT
and OFFSET
couldn't not be used. This is a very common use case, and so I think we really need to fix this going forward.
Then the question is if we should fix this by some surface syntax, but then, @kasei came up with this idea of a windowing function. I find it intriguing, as I would assume it has use cases well beyond mine, event stream processing, as implemented by several extensions, would seem to benefit as well by the introduction of this algebra.
Thus, a windowing function shouldn't be considered just for my original case, even though I think that is an important one, but should be considered more generally.
one way to read this is that some properties of the execution process - such as ordinal in a given solution sequence, should be available as first-class values and that the aggregation operation should be extended with operations which use them.
@lisp Yes, that could solve some (but not all) of the cases for which window functions are useful. However, window functions go beyond the expressivity of just having access to an ordinal number. For example, this can be seen in cases where the value can represent ties (RANK
vs. ROW_NUMBER
), where the window frame is smaller than the partition group (e.g. while computing a moving average), or where the computed value uses both the ordinal of the current row and information from the entire partition (e.g. the NTILE
function). Some of these might be able to be computed with syntactically cumbersome repetition using just aggregates and existing graph operators, but not all of them, and certainly not in as concise a fashion.
ok. are there any execution properties which do not devolve to an ordinal in some solution sequence?
@lisp
(I should say that I'm not an expert here, but I've found window functions useful enough to implement in my SPARQL systems.)
Window functions can do anything they want on the rows in the window frame, so I imagine there's a lot of potential for ones that don't devolve to ordinals. Amongst the common window functions, I'd think that anything that dealt with ties would not be equivalent to ordinal-based approaches. This includes things like RANK
which might produce a sequence like (1, 1, 1, 4, 4, 6) (as opposed to ROW_NUMBER
which would just be 1, 2, 3, 4, 5, 6).
Ties can also impact the rows that comprise the window frame. Instead of defining the window frame as a range of row offsets, it can be defined as operating over groupings of rows that are deemed equivalent based on the window ordering. (Hard to explain in a quick comment here; this can be looked up as the difference between using ROWS
and RANGE
.) I've not worked with use cases that needed this functionality but I suspect that this would be incredibly difficult or impossible to do based just on row ordinals.
On 2019-04-08, at 03:55:05, Gregory Todd Williams notifications@github.com wrote:
@lisp
(I should say that I'm not an expert here, but I've found window functions useful enough to implement in my SPARQL systems.)
(neither am i, but i suppose, that may have to change. that is why i am trying to deconstruct them.)
Window functions can do anything they want on the rows in the window frame, so I imagine there's a lot of potential for ones that don't devolve to ordinals. Amongst the common window functions, I'd think that anything that dealt with ties would not be equivalent to ordinal-based approaches. This includes things like RANK which might produce a sequence like (1, 1, 1, 4, 4, 6) (as opposed to ROW_NUMBER which would just be 1, 2, 3, 4, 5, 6).
that is described (elsewhere) as though it devolves to a sequence number, just not directly.
Ties can also impact the rows that comprise the window frame.
please describe what you do with ties.
Instead of defining the window frame as a range of row offsets, it can be defined as operating over groupings of rows that are deemed equivalent based on the window ordering.
see, above, my comprehension of rank.
(Hard to explain in a quick comment here; this can be looked up as the difference between using ROWS and RANGE.) I've not worked with use cases that needed this functionality but I suspect that this would be incredibly difficult or impossible to do based just on row ordinals.
everything which i have found from sql and nosql sources describes (a) value(s) which devolve(s) to (a) position(s) in the equivalent of (a) solution sequence(s).
possible alternative could be:
@dydra
that is described (elsewhere) as though it devolves to a sequence number, just not directly.
everything which i have found from sql and nosql sources describes (a) value(s) which devolve(s) to (a) position(s) in the equivalent of (a) solution sequence(s).
Could you point me at some of these descriptions? I'm not clear how some of these can be reduced to just sequence numbers. For example, using RANK() on data with ties:
SELECT ?row (RANK() OVER (PARTITION BY ?partition ORDER BY ?value) AS ?rank) WHERE {
VALUES (?row ?partition ?value) {
(1 1 1)
(2 1 2)
(3 1 2)
(4 1 2)
(5 2 3)
(6 2 3)
}
}
[run]%20WHERE%20%7B%0A%09VALUES%20(?row%20?partition%20?value)%20%7B%0A%09%09(1%201%201)%0A%09%09(2%201%202)%0A%09%09(3%201%202)%0A%09%09(4%201%202)%0A%09%09(5%202%203)%0A%09%09(6%202%203)%0A%09%7D%0A%7D%0A)
The resulting ?rank
row ordinals with corresponding ?rank
values are:
row | rank |
---|---|
1 | 1 |
1 | 2 |
3 | 2 |
4 | 2 |
5 | 1 |
6 | 1 |
I'm not sure how that could be reduced to just the ordinals, but maybe you have some ideas.
A more complicated example (and for which I don't have an implementation) would be using window framing using RANGE
instead of ROWS
. I've found the SQLite Window Functions Introduction to be one of the most accessible descriptions of the complexity involved here.
Perhaps we can discuss specific examples? Using a strawman syntax for accessing sequence numbers, can you show or think of an approach to implementing some of the examples above without window functions? My instinct is that many queries (like moving average, or photos per country) might be possible but will quickly become syntactically cumbersome when making small changes (e.g. changing the number of photos per country from 3 to 10), to such an extent that they become either unmanageable or so complex that query runtime becomes prohibitive (or both).
Could you point me at some of these descriptions?
starting simply with a search for the difference btw rank and row number, various descriptions read very much as when the results reflect positions in some sequence. as does the sqllite description. this sequence need not be the original solution sequence, but intermediate results of window operations are solution sequences just the same.
@lisp I found that second link rather unhelpful. I couldn't reconcile the description with my understanding of RANK
or the actual results shown in the example. Did you take a look at the SQLite documentation by any chance? It's not perfect, but I found its definitions and description of peer groups to be much better than most other documentation.
note, please, above, "as does the sqllite description".
Apologies, I had missed the SQLite mention in your last response.
this sequence need not be the original solution sequence, but intermediate results of window operations are solution sequences just the same.
Perhaps I've lost track of what your concerns are. I think ordinal numbers are used in a few places in window functions and in the examples we've discussed. They are the result of ROW_NUMBER
, and are required for computing things like RANK
and DENSE_RANK
. They are also required for defining a window frame (either directly in cases like ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
, or indirectly after peer groups have been computed).
However, I thought your original suggestion was that the same results could be achieved without window functions if we had the ability to access or compute the ordinal number of solutions in a result set. I don't believe this is the case, at least not in the general case and/or without tremendous complexity in the query to achieve similar semantics. Window function syntax allows concise declaration of the intended function, window, and optional framing without having to jump through hoops to do the same thing (e.g. without framing, many window queries could probably be alternatively computed by using aggregate subqueries to compute the window partitioning and adding extra joins to get the partition key into each solution. I'm not sure just how far this approach could take you in approximating larger subsets of window functions' expressivity, though.)
I thought your original suggestion was that the same results could be achieved without window functions if we had the ability to access or compute the ordinal number of solutions in a result set.
that was way back at the start and has long since been supplanted by my response to
(I should say that I'm not an expert here, but I've found window functions useful enough to implement in my SPARQL systems.)
(neither am i, but i suppose, that may have to change. that is why i am trying to deconstruct them.)
in connection with which intent, i conjectured (in my response on 08.04) as to alternatives to ordinals. in addition to which list, i wonder about operators which incorporate specific alternative preceding or succeeding solutions.
Hello guys, I’ve been working on a few articles that I have published under agnos.ai on Medium.com that I would like to share with you. I think that the great thing about these articles is that they use a RDF triplestore that was migrated from a well-known sample relational database, which allows people to run SQL and SPARQL queries side by side. This can be very useful for learning SPARQL, when you already know SQL, or to validate new feature proposals for SPARQL 1.2, for example, where you can show what you want in SQL, and then validate future implementations by running SPARQL against a equivalent triplestore. https://medium.com/agnos-ai/sql-features-not-supported-in-sparql-1-1-eb34e3519077 Let me know what you think and any feedback is very welcome. Thank you. Marcelo.
Sounds to me like Scrollable Cursors (see ODBC, JDBC, SQL, etc.) are being reinvented...
Hard to handle in a RESTian world, but quite valuable if specified and implemented well!
SPARQL should add support for window functions. This would increase expressivity and address some existing use cases such as "limit per resource".
Why
Window functions would allow computing values that are unavailable in SPARQL 1.1 queries:
These can be used to address use cases such as limiting the result set to a specific number of results for each resource ("limit per resource"). For example, consider a query to retrieve information about web posts:
Given that a post can have any number of titles and authors, we might wish to restrict our query to only providing information about at most 3 authors for any individual post. This isn't easily done using standard SPARQL, but can be addressed using window functions.
Previous work
Proposed solution
Using a
RANK
window function, we can filter the result set of the example query above with aHAVING
clause:This will take the the result set from matching the basic graph pattern, and partition it into groups based on the value of
?post
. Within each partition, rows will be sorted by?author
, and then assigned an increasing integer rank. Finally, these rows will be filtered to keep only those with a rank less than or equal to2
. The final result set will be the concatenation of rows in each partition.Beyond this use case, existing aggregates (e.g.
AVG
andSUM
) can be used with windows to support things like moving averages and running totals.Considerations for backward compatibility
None.