w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
124 stars 19 forks source link

Support window functions #47

Open kasei opened 5 years ago

kasei commented 5 years ago

SPARQL should add support for window functions. This would increase expressivity and address some existing use cases such as "limit per resource".

Why

Window functions would allow computing values that are unavailable in SPARQL 1.1 queries:

These can be used to address use cases such as limiting the result set to a specific number of results for each resource ("limit per resource"). For example, consider a query to retrieve information about web posts:

SELECT ?post ?title ?author ?date WHERE {
    ?post a sioc:Post ;
        dc:title ?title ;
        sioc:has_creator ?author
}

Given that a post can have any number of titles and authors, we might wish to restrict our query to only providing information about at most 3 authors for any individual post. This isn't easily done using standard SPARQL, but can be addressed using window functions.

Previous work

Proposed solution

Using a RANK window function, we can filter the result set of the example query above with a HAVING clause:

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
SELECT ?post ?title ?author ?date WHERE {
    ?post a sioc:Post ;
        dc:title ?title ;
        sioc:has_creator ?author
}
HAVING (RANK() OVER (PARTITION BY ?post ORDER BY ?author) <= 2)

This will take the the result set from matching the basic graph pattern, and partition it into groups based on the value of ?post. Within each partition, rows will be sorted by ?author, and then assigned an increasing integer rank. Finally, these rows will be filtered to keep only those with a rank less than or equal to 2. The final result set will be the concatenation of rows in each partition.

Beyond this use case, existing aggregates (e.g. AVG and SUM) can be used with windows to support things like moving averages and running totals.

Considerations for backward compatibility

None.

RickMoynihan commented 5 years ago

You can achieve this already with a little more syntax by using a sub SELECT with the LIMIT on to return the subjects, then not setting a limit on the outer query.

kasei commented 5 years ago

@RickMoynihan for specific cases that's true, but I believe window functions can more naturally express some of these use cases and definitely increases expressivity for others. I used this specific example only because it was the one used in the original LimitPerResource feature request for SPARQL 1.1.

kasei commented 5 years ago

(Updated the example to use ROW_NUMBER instead of RANK because this example doesn't use partition ordering. However, RANK is an example of a function that would be hard to approximate using just SPARQL 1.1 due to its handling of ties.)

dydra commented 5 years ago

window functions can more naturally express some of these use cases and definitely increases expressivity for others

please describe such cases.

kasei commented 5 years ago

@dydra I believe the example above is a prime example where it is more naturally expressed using a window function; having to duplicate the BGP in this query simply to pull out a distinct list of posts would be syntactically much more verbose, and this is a fairly straightforward example. SPARQL 1.1 saw precedent for adding features that in part allowed simplifying existing queries that were unnecessarily complex (property paths, negation, etc.). While I think window functions' increased expressivity alone make them worthwhile to discuss for inclusion, the simplification of some types of queries would also represent a big improvement.

Also, as mentioned, computing a moving average is an example that cannot be addressed with SPARQL 1.1. Another common window function I've used in production and would love to see come to SPARQL is quantile computation.

kasei commented 5 years ago

@dydra @RickMoynihan

Apologies. In my haste to compile and submit issues today I seem to have botched the LimitPerResource example as well as my (obviously untested) query showing how to apply window functions to it. This was compounded by just skimming the original feature request page for SPARQL 1.1, which didn't really cover the real use cases as I have understood them in the intervening years.

My current understanding of the "limit per resource" use case would be this: you want to limit the number of results you get from the query for each distinct value of some set of variables. For the SIOC posts example that started this off, this would be something like "for each post, I'm interested in getting data for at most 3 authors." I hope it will be clear why this is a more interesting and complicated example. For this, window functions can be used by partitioning the results by the ?post variable, assigning a rank within each partition ordered by author bound to a new variable ?rank, and then filtering all results such that ?rank <= 3.

I hope that helps to clarify how window functions could address this specific use case. I'll try to update the example in the issue text to more accurately represent this use case. As I said, though, I think their value goes well beyond this use case, and can provide some nice features that we currently don't have.

VladimirAlexiev commented 5 years ago

@kasei Fully support. Can you craft more examples of the uses you mentioned, eg moving averages?

kasei commented 5 years ago

@VladimirAlexiev

Again, using the strawman syntax I've implemented, here are couple of examples:

# Moving average over the past 4 quarters
SELECT ?quarter ?value (AVG(?value) OVER (ORDER BY ?quarter ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ?movingAverage) WHERE {
    VALUES (?quarter ?value) {
        ("2017Q1" 1)        # 1.0   = avg(1) ; only 1 Q of data
        ("2017Q2" 2)        # 1.5   = avg(1, 2) ; only 2 Q of data
        ("2017Q3" 1)        # 1.33  = avg(1, 2, 1) ; only 3 Q of data
        ("2017Q4" 2)        # 1.5   = avg(1, 2, 1, 2)
        ("2018Q1" 3)        # 2.0   = avg(2, 1, 2, 3)
        ("2018Q2" 2)        # 2.0   = avg(1, 2, 3, 2)
        ("2018Q3" 2)        # 2.25  = avg(2, 3, 2, 2)
        ("2018Q4" 4)        # 2.75  = avg(3, 2, 2, 4)
        ("2019Q1" 5)        # 3.25  = avg(2, 2, 4, 5)
    }
}
ORDER BY ?quarter
# 3 photos from each country
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?image ?country WHERE {
    ?image a foaf:Image ;
        dcterms:coverage [ foaf:name ?country ; dcterms:type "Country" ] ;
        .
}
HAVING (ROW_NUMBER() OVER (PARTITION BY ?country) <= 3)
ORDER BY ?country

I can't guarantee it's uptime, but these can be run at this endpoint.

kjetilk commented 5 years ago

I was the source of LimitPerResource way back before the SPARQL 1.1 work started, and it was a part of the initial work of the WG, but didn't make it in, something I have suffered ever since.

The initial use case was that we had a database of articles, where each article could be categorized into a number of categories and have any number of authors, and it had a bunch of optional fields. All this could yield a cyclic graph with a bit of heterogeneity when we CONSTRUCTed results, and what we found (this was before subqueries too), was that while SPARQL 1.0 would often work over heterogeneous data, but apart from OPTIONAL, wasn't very well adapted to it. Subqueries helped quite a lot, but I noticed when moving from industry to academia, that students didn't think of subqueries when they were faced with this problem:

you don't care about the number of solutions to query, you care about the number of "things"

So, the subquery route to achieving such a simple requirement quickly turned out to be very developer unfriendly. We had a hard time implementing paging of results as pure LIMIT and OFFSET couldn't not be used. This is a very common use case, and so I think we really need to fix this going forward.

Then the question is if we should fix this by some surface syntax, but then, @kasei came up with this idea of a windowing function. I find it intriguing, as I would assume it has use cases well beyond mine, event stream processing, as implemented by several extensions, would seem to benefit as well by the introduction of this algebra.

Thus, a windowing function shouldn't be considered just for my original case, even though I think that is an important one, but should be considered more generally.

lisp commented 5 years ago

one way to read this is that some properties of the execution process - such as ordinal in a given solution sequence, should be available as first-class values and that the aggregation operation should be extended with operations which use them.

kasei commented 5 years ago

@lisp Yes, that could solve some (but not all) of the cases for which window functions are useful. However, window functions go beyond the expressivity of just having access to an ordinal number. For example, this can be seen in cases where the value can represent ties (RANK vs. ROW_NUMBER), where the window frame is smaller than the partition group (e.g. while computing a moving average), or where the computed value uses both the ordinal of the current row and information from the entire partition (e.g. the NTILE function). Some of these might be able to be computed with syntactically cumbersome repetition using just aggregates and existing graph operators, but not all of them, and certainly not in as concise a fashion.

lisp commented 5 years ago

ok. are there any execution properties which do not devolve to an ordinal in some solution sequence?

kasei commented 5 years ago

@lisp

(I should say that I'm not an expert here, but I've found window functions useful enough to implement in my SPARQL systems.)

Window functions can do anything they want on the rows in the window frame, so I imagine there's a lot of potential for ones that don't devolve to ordinals. Amongst the common window functions, I'd think that anything that dealt with ties would not be equivalent to ordinal-based approaches. This includes things like RANK which might produce a sequence like (1, 1, 1, 4, 4, 6) (as opposed to ROW_NUMBER which would just be 1, 2, 3, 4, 5, 6).

Ties can also impact the rows that comprise the window frame. Instead of defining the window frame as a range of row offsets, it can be defined as operating over groupings of rows that are deemed equivalent based on the window ordering. (Hard to explain in a quick comment here; this can be looked up as the difference between using ROWS and RANGE.) I've not worked with use cases that needed this functionality but I suspect that this would be incredibly difficult or impossible to do based just on row ordinals.

dydra commented 5 years ago

On 2019-04-08, at 03:55:05, Gregory Todd Williams notifications@github.com wrote:

@lisp

(I should say that I'm not an expert here, but I've found window functions useful enough to implement in my SPARQL systems.)

(neither am i, but i suppose, that may have to change. that is why i am trying to deconstruct them.)

Window functions can do anything they want on the rows in the window frame, so I imagine there's a lot of potential for ones that don't devolve to ordinals. Amongst the common window functions, I'd think that anything that dealt with ties would not be equivalent to ordinal-based approaches. This includes things like RANK which might produce a sequence like (1, 1, 1, 4, 4, 6) (as opposed to ROW_NUMBER which would just be 1, 2, 3, 4, 5, 6).

that is described (elsewhere) as though it devolves to a sequence number, just not directly.

Ties can also impact the rows that comprise the window frame.

please describe what you do with ties.

Instead of defining the window frame as a range of row offsets, it can be defined as operating over groupings of rows that are deemed equivalent based on the window ordering.

see, above, my comprehension of rank.

(Hard to explain in a quick comment here; this can be looked up as the difference between using ROWS and RANGE.) I've not worked with use cases that needed this functionality but I suspect that this would be incredibly difficult or impossible to do based just on row ordinals.

everything which i have found from sql and nosql sources describes (a) value(s) which devolve(s) to (a) position(s) in the equivalent of (a) solution sequence(s).

possible alternative could be:

kasei commented 5 years ago

@dydra

that is described (elsewhere) as though it devolves to a sequence number, just not directly.

everything which i have found from sql and nosql sources describes (a) value(s) which devolve(s) to (a) position(s) in the equivalent of (a) solution sequence(s).

Could you point me at some of these descriptions? I'm not clear how some of these can be reduced to just sequence numbers. For example, using RANK() on data with ties:

SELECT ?row (RANK() OVER (PARTITION BY ?partition ORDER BY ?value) AS ?rank) WHERE {
    VALUES (?row ?partition ?value) {
        (1 1 1)
        (2 1 2)
        (3 1 2)
        (4 1 2)
        (5 2 3)
        (6 2 3)
    }
}

[run]%20WHERE%20%7B%0A%09VALUES%20(?row%20?partition%20?value)%20%7B%0A%09%09(1%201%201)%0A%09%09(2%201%202)%0A%09%09(3%201%202)%0A%09%09(4%201%202)%0A%09%09(5%202%203)%0A%09%09(6%202%203)%0A%09%7D%0A%7D%0A)

The resulting ?rank row ordinals with corresponding ?rank values are:

row rank
1 1
1 2
3 2
4 2
5 1
6 1

I'm not sure how that could be reduced to just the ordinals, but maybe you have some ideas.

A more complicated example (and for which I don't have an implementation) would be using window framing using RANGE instead of ROWS. I've found the SQLite Window Functions Introduction to be one of the most accessible descriptions of the complexity involved here.

Perhaps we can discuss specific examples? Using a strawman syntax for accessing sequence numbers, can you show or think of an approach to implementing some of the examples above without window functions? My instinct is that many queries (like moving average, or photos per country) might be possible but will quickly become syntactically cumbersome when making small changes (e.g. changing the number of photos per country from 3 to 10), to such an extent that they become either unmanageable or so complex that query runtime becomes prohibitive (or both).

lisp commented 5 years ago

Could you point me at some of these descriptions?

starting simply with a search for the difference btw rank and row number, various descriptions read very much as when the results reflect positions in some sequence. as does the sqllite description. this sequence need not be the original solution sequence, but intermediate results of window operations are solution sequences just the same.

kasei commented 5 years ago

@lisp I found that second link rather unhelpful. I couldn't reconcile the description with my understanding of RANK or the actual results shown in the example. Did you take a look at the SQLite documentation by any chance? It's not perfect, but I found its definitions and description of peer groups to be much better than most other documentation.

lisp commented 5 years ago

note, please, above, "as does the sqllite description".

kasei commented 5 years ago

Apologies, I had missed the SQLite mention in your last response.

this sequence need not be the original solution sequence, but intermediate results of window operations are solution sequences just the same.

Perhaps I've lost track of what your concerns are. I think ordinal numbers are used in a few places in window functions and in the examples we've discussed. They are the result of ROW_NUMBER, and are required for computing things like RANK and DENSE_RANK. They are also required for defining a window frame (either directly in cases like ROWS BETWEEN 1 PRECEDING AND CURRENT ROW, or indirectly after peer groups have been computed).

However, I thought your original suggestion was that the same results could be achieved without window functions if we had the ability to access or compute the ordinal number of solutions in a result set. I don't believe this is the case, at least not in the general case and/or without tremendous complexity in the query to achieve similar semantics. Window function syntax allows concise declaration of the intended function, window, and optional framing without having to jump through hoops to do the same thing (e.g. without framing, many window queries could probably be alternatively computed by using aggregate subqueries to compute the window partitioning and adding extra joins to get the partition key into each solution. I'm not sure just how far this approach could take you in approximating larger subsets of window functions' expressivity, though.)

lisp commented 5 years ago

I thought your original suggestion was that the same results could be achieved without window functions if we had the ability to access or compute the ordinal number of solutions in a result set.

that was way back at the start and has long since been supplanted by my response to

(I should say that I'm not an expert here, but I've found window functions useful enough to implement in my SPARQL systems.)

(neither am i, but i suppose, that may have to change. that is why i am trying to deconstruct them.)

in connection with which intent, i conjectured (in my response on 08.04) as to alternatives to ordinals. in addition to which list, i wonder about operators which incorporate specific alternative preceding or succeeding solutions.

mbarbieri77 commented 3 years ago

Hello guys, I’ve been working on a few articles that I have published under agnos.ai on Medium.com that I would like to share with you. I think that the great thing about these articles is that they use a RDF triplestore that was migrated from a well-known sample relational database, which allows people to run SQL and SPARQL queries side by side. This can be very useful for learning SPARQL, when you already know SQL, or to validate new feature proposals for SPARQL 1.2, for example, where you can show what you want in SQL, and then validate future implementations by running SPARQL against a equivalent triplestore. https://medium.com/agnos-ai/sql-features-not-supported-in-sparql-1-1-eb34e3519077 Let me know what you think and any feedback is very welcome. Thank you. Marcelo.

TallTed commented 3 years ago

Sounds to me like Scrollable Cursors (see ODBC, JDBC, SQL, etc.) are being reinvented...

Hard to handle in a RESTian world, but quite valuable if specified and implemented well!