w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
121 stars 19 forks source link

Keyword for functions that can produce multiple bindings #6

Open HolgerKnublauch opened 5 years ago

HolgerKnublauch commented 5 years ago

Currently, BIND assignments can produce at most one binding for a single variable.

Many SPARQL implementations have a notion of "special" or "magic" predicates that can be used to bind multiple variables at once, and to produce multiple rows at once. Examples include Jena's built-in magic properties/property functions (e.g. https://jena.apache.org/documentation/query/extension.html#property-functions) and various free-text search implementations that take a string as input and produce all subjects that have values. Further, some implementations even use the SERVICE keyword as a hack, e.g. SERVICE wikibase:label. From our own modest experience there are tons of use cases for such multi-binding functions.

Even if SPARQL 1.2 doesn't include any specific built-in multi-binding functions, it would be great to have a proper syntax and an extension point so that these special functions can be consistently identified, supported by syntax-directed editors etc.

I struggle to find a proper keyword, so as a placeholder I am using ITERATE in this proposal:

SELECT * WHERE { ... ITERATE (ex:myFunction(?inputA, ?inputB, ...) AS ?outputA, ?outputB, ...) . }

The above would produce an iterator of bindings evaluating whatever is implemented by ex:myFunction. The right hand side would have unbound variables that get bound in each iteration. It's similar to BIND otherwise.

As related work, SHACL-AF includes a mechanism to declare new SPARQL functions that wrap ASK or SELECT queries. Obviously this could be extended to turn SELECT queries into reusable "stored procedures" that can then be reused in other queries using the new keyword.

cygri commented 5 years ago

A very common use case for this is splitting a string on a delimiter. In Jena this can be done with the apf:strSplit property function:

?segment apf:strSplit (?s ", ")

This works, but the property function syntax is pretty obscure and is really an abuse of triple pattern syntax.

So let us consider other syntax options. Extending the BIND keyword for this looks much clearer:

BIND (split(?s, ", ") AS ?segment)

The idea is to extend BIND to allow producing multiple values, leading to multiple result solutions per input solution. One could even imagine something like:

BIND (concat("segment ", split(?s, ", ")) AS ?segment)

split returns multiple values, and the concat function would be called for each result, again leading to multiple result solutions per input solution. (It's less clear how that proposal would work so nicely if the output of the function is multiple variables like in Holger's example.)

rvesse commented 5 years ago

+1000

This is one of my big bug bears with the current syntax and the abuse of it by magic properties. You definitely want to list out all the resulting variables so that it is clear what the output columns would be although you probably also want some way to have implicit outputs.

A lot of existing "magic" properties are used to join extra info or filter the existing result set so the input variables are implicitly included in the output variables in many cases

HolgerKnublauch commented 5 years ago

I am aware that some magic properties are supposed to work "in all directions", i.e. regardless of which combination on left/right side is present. I don't think this (complex) contract is often needed in practice, so I'd suggest we focus on a clear separation between input and output.

jindrichmynarz commented 5 years ago

Would allowing BIND to produce multiple bindings (e.g., via a SPARQL extension function) break the current evaluation model?

cygri commented 5 years ago

Would allowing BIND to produce multiple bindings (e.g., via a SPARQL extension function) break the current evaluation model?

In current SPARQL, the argument to BIND is an expression, and expressions return exactly one value (or an error). Extension functions are one kind of expression, so they cannot return multiple values. Allowing multiple return values from an expression would be a change to the evaluation model. It also raises the question what would happen in other contexts where expressions are used, like in FILTER, in SELECT ... (expr AS ?var) ..., or in ORDER BY.

rvesse commented 5 years ago

I think you would definitely need a different keyword to distinguish the existing row-in row-out semantics of BIND from this use case

cygri commented 5 years ago

I think you would definitely need a different keyword to distinguish the existing row-in row-out semantics of BIND from this use case

Why? If the semantics of BIND were changed to multiple-solutions-out-per-solution-in, then the current one-solution-out-per-solution-in semantics would continue to work as an edge case. Expressions that return zero or more results (as all SPARQL 1.1 expressions do) would continue to behave as they do today. So it's a clean extension even without a new keyword.

I'm more worried about other uses of expressions like FILTER, where it's less obvious how to handle expressions with multiple results.

joernhees commented 5 years ago

i agree with the reasoning about reusing BIND (as long as we're not in the projection part anyhow).

Wrt. to tuple asignments: how about something like expr AS (?v1 ?v2) (so additional parenthesis denote an explicit tuple assignment, a AS ?v1 could become syntactic sugar for AS (?v1))?

dydra commented 5 years ago

I'm more worried about other uses of expressions like FILTER, where it's less obvious how to handle expressions with multiple results.

would that not be just a matter of the respective function signatures? they would need to agree between argument results and function parameters. if the effective boolean were extended to define a truth value for result sets, that would close it.

dydra commented 5 years ago

allegro provides one example as to how to approach this "signature" aspect of this issue. it permits a list of terms as both subject and object respective a functional predicate term. they call them "magic" properties.

cygri commented 5 years ago

I am losing confidence in my proposal to extend BIND for this. It gets confusing because these solution sequence returning functions are quite different from normal SPARQL functions, and I don't think we want to imply that they can be used anywhere that normal SPARQL functions/expressions can be used.

Then how about extending a different existing keyword?

VALUES ?item OF split(?items, ", ")
VALUES (?ns ?local) OF splitIRI(?s)
VALUES (?col1 ?col2 ?col3) OF my:readCSV(<data.csv>)

So the result is a solution sequence—just like with the current VALUES clause—but the bound values are provided by the function, rather than listed literally in the clause.

dydra commented 5 years ago

how about

{ (?col1 ?col2 ?col3) my:readCSV (<data.csv>) 
... }

?

cygri commented 5 years ago

@dydra Did you read the issue? That is exactly what everyone is doing at the moment, and the point of this issue is that we want to get away from that abuse of triple pattern syntax.

dydra commented 5 years ago

i did and i do not find it to be abuse.

cygri commented 5 years ago

@dydra Property functions a.k.a. magic properties require selectively ignoring the SPARQL spec.

To add, for example, an apf:splitStr-like property function to the SPARQL spec, the handling of basic graph patterns in the spec would first have to be changed to explain that certain parts of certain BGPs are exempt from normal processing and are instead interpreted as property functions.

The syntax is inconvenient as it doesn’t allow expressions as arguments.

There are conceptual problems with property functions too. Are the inputs on the left side and the outputs on the right, or vice versa? Which arguments must be fixed and which may be left variable? How does a SPARQL engine know whether to execute other parts of the graph pattern first in order to produce bound values for some of the variables?

The ITERATE, BIND and VALUES proposals have none of these problems.

afs commented 5 years ago

Would someone care to create and populate a page on the wiki that records what existing approaches there are?

Virtual graphs and use of SERVICE for accessing custom functionality fall under the multiple bindings concept.

From an implementers POV, it would be very useful to know which are the input arguments (must be bound) and which are the output arguments (which must be bound by the call).

dydra commented 5 years ago

Property functions a.k.a. magic properties require selectively ignoring the SPARQL spec. The ITERATE, BIND and VALUES proposals have none of these problems.

if i understand that argument, it suggests that, to change the semantics of function invocation to permit multiple value return and/or a single invocation to yield a solution sequence is without problems, while to define the semantics for a form such as

prefix fn: <http://unifying-functions.org/>
select *
where {
  (?r1 ?r2) fn:split (?someString ?arg1)
}"))

or, in other words,

(select
 (bgp
   {??2 fn:split ??4}
   {??4 rdf:first ?someString}
   {??4 rdf:rest ??3}
   {??3 rdf:first ?arg1}
   {??3 rdf:rest rdf:nil}
   {??2 rdf:first ?r1}
   {??2 rdf:rest ??1}
   {??1 rdf:first ?r2}
   {??1 rdf:rest rdf:nil})
 (?arg1 ?r1 ?r2 ?someString))

would somehow conflict with 18.7 .

while multiple values are very useful and would serve the language well, were they to be supported, the changes which would be required to Expression to permit solution sequences as results - and thereby as arguments, appear at first glance to be much more complex than a definition for bgp-based invocation which conforms to 18.7 .

cygri commented 5 years ago

@dydra I didn't say that the ITERATE, BIND and VALUES proposals are without problems. I said they don't share several specific problems of property functions.

I didn't say that property functions conflict with section 18.7, and if they do, I wouldn't know! I said that adding property functions to SPARQL would require changes to the handling of BGPs. 18.7 is one way of doing that.

I am not advocating a change to SPARQL functions as used in expressions. I did so previously, but as I said further up in this thread, I no longer think it's a good idea.

I am (at least today) advocating this form:

VALUES ?item OF split(?items, ", ")

The split here is a “solution sequence producer” for lack of a better term. Its arguments are normal expressions, its output is a solution sequence, and it can only occur in a VALUES OF clause. Like aggregates, it shares the syntax of function calls, but is not a normal SPARQL function.

dydra commented 5 years ago

ok. i remain concerned. for example, what does

VALUES ?item OF concat('=', split(?items, ", "), '?')

yield?

cygri commented 5 years ago

The “function call” of a VALUES OF clause must be a “solution sequence producer”, so concat, being a normal function, cannot be used there.

That is also the only place where an SSP is allowed, so split cannot be an argument to a normal function.

dydra commented 5 years ago

ok, no problem.

jindrichmynarz commented 5 years ago

I like the way @cygri's proposal is going. What if we treated the VALUES DataBlock as a data structure literal for solution sequences? Can we allow substituting it with functions that produce solution sequences? Then, perhaps the OF symbol can be dropped.

cygri commented 5 years ago

@jindrichmynarz So that would be:

VALUES ?item split(?list, ", ")
VALUES (?col1 ?col2) my:readCVS(<file.csv>)

I added the OF after the variable list just to make it read more intuitively. From a grammar/definition point of view, dropping it would be no problem.

VALUES with a solution sequence producing function is a bit different from the existing inline-data form. The former may have variables in the arguments, so needs to be evaluated against the mappings produced by preceding clauses. The latter is essentially a literal; its result doesn't depend on anything outside the clause.

afs commented 5 years ago

This is an high level outline to test how to proceed.

There is a connection to issue #10 (Generalize SERVICE for non-SPARQL endpoints) and others.

There could be a single mechanism for extensions named by URI that takes arguments and returns a result table.

There are important details to work on such as whether and how the callout declares its "in"and "out" parameters and whether the "in" parameters must be.

There may be specific syntax for important cases to make the mechanism easier to use for those cases and aid engines providing only certain common cases.

6 - Keyword for functions that can produce multiple bindings

10 - Generalize SERVICE for non-SPARQL endpoints

40 - Standardize free text search of RDF data

44 - Named solution sets

[R2RML]https://www.w3.org/TR/r2rml/) CSVW

dydra commented 5 years ago

in this illustration

VALUES ?item split(?list, ", ")

the specific example is without context. were it placed in some query context, how would that query to be parsed such that ?list is in the scope of some binding?

cygri commented 5 years ago

@dydra I imagine it would be like BIND. Variables from the preceding graph patterns in the same group would be in scope.

Thinking out loud: Another useful form might be VALUES * OF my:xyz(...). In that form, it is not the query writer who picks the variable names, but the function makes the choice. Use cases include:

VladimirAlexiev commented 5 years ago

Referencing #14, could the syntax be something like

values (?company ?industry) of apf:strSplitParallel ((?companies,",") (?industries,","))
cygri commented 5 years ago

@VladimirAlexiev Why the nesting in the arguments? It is complicated, and at least for this use case not needed. I think SPARQL's existing function call syntax is sufficient:

VALUES (?company ?industry) OF my:strSplitParallel(?companies, ",", ?industries, ",")

Must have an even number of arguments. Arguments 1, 3, etc. are the strings to split. 2, 4, etc. are the separators for the preceding argument.

afs commented 5 years ago

Implementation experience - knowing the signature of the function helps optimization. For this feature, the undeclared output variables would make things like putting filters in the optimal place in the execution harder - that's just one example and there are other possible optimizations.

There 3 cases from my implementation experience:

rdfs:seeAlso #44.

afs commented 5 years ago

Sometime multi-functions take a lot of argument, some of which are optional. One trick is that the argument list is variable length but that does not completely work because it forces the order of the input variables and that might be confusing. Also, to set a later one all the earlier ones have to be defined to get the position right. One way is to allow UNDEF (not the best word in this context) but the length and alignment of the list of argument still has to be right.

Named arguments are more readable:

 OF ex:algorithm(?fixed1, ?fixed2, { "a": ?opt1, "b": ?opt2 })

The parsing details need further work and depend on how complicated to make the parsing technology required.

cygri commented 5 years ago

@afs I was thinking that the list of variable names in the VALUES * OF func(args) form would only be allowed to depend on the unevaluated arguments, and not on anything in the solutions.

So, a mechanism for defining multi-functions (e.g., a Java interface MultiFunction) could make it a two-stage process where a method Var[] build(Expr[]) is called at query parse/build time, and the list of variable names is fixed after that. So the query planner would know what variables are bound in the results of the function.

Such an interface for implementing multi-functions could also require implementations to announce whether each variable is always bound or might be unbound, if that is useful information for the query optimiser. That would be an implementation issue.

So, the following would not be allowed, because the list of variables (in the first row of the CSV file) depends on the evaluated argument, and hence is not yet known at query build time:

VALUES * OF my:readCSV(?csvFile)

But the following would be allowed; the file name is known at query build time, and the function implementation can read the first line at that time to return the variable list:

VALUES * OF my:readCSV(<data.csv>)

The following would be allowed, because it's not VALUES *, the variables are explicitly listed:

VALUES (?col1 ?col2 ?col3) OF my:readCSV(?file)

About adding named arguments in addition to positional arguments are worth considering, but if a syntax is introduced for them, then I don't see why that shouldn't also be allowed on normal functions and aggregates. As they are orthogonal to the kind of function, they should get their own issue.

afs commented 5 years ago

So is this understanding right?

There is a phase during SPARQL execution where multi-bind functions declare their result variables so the SPARQL engine has access to a function signature for each multi-bind function instance. The information provided to the dynamic signature request includes the input argument as syntactically declared (based on the example of VALUES * OF wikibase:label("en", ?p, ?w)).

That would need some extra conditions:

VALUES * OF m:function()   # Returns ?V
BIND(expr AS ?V)

but at first pass looks workable.

Incidentally, that would relate to SELECT expressions, including aggregates, without needing an AS.

cygri commented 5 years ago

@afs That's correct. The spec would have to say that the output variables of VALUES * OF myFunction, as determined by myFunction, are in scope. That would take care of the right behaviour for BIND and SELECT *.

HolgerKnublauch commented 3 years ago

FWIW for TopBraid we are adding a declarative mechanism for defining such functions: http://datashapes.org/multifunctions.html

As shown it can be mapped to (Jena) property functions, yet some official syntax would be much cleaner in my opinion.

lisp commented 3 years ago

given the invoking form

SELECT *
WHERE {
    schema:DayOfWeek ex:namedSuperClasses ( ?superClass ?label ) .
}

why is the declaration necessary?

is it not sufficient to just extablish that syntax for multiple value returns?

HolgerKnublauch commented 3 years ago

I don't understand your question. How could a SPARQL processor know what it needs to do without a declaration mechanism first? Hard-code ex:namedSuperClasses? All that a normal SPARQL engine will see is ex:namedSuperClasses is a predicate so it will look up matches in the graph, and know nothing about the special meaning.

lisp commented 3 years ago

How could a SPARQL processor know what it needs to do without a declaration mechanism first?

it knows how many return values to accommodate because it has the invocation form.

it needs a means with which to determine whether the term in the predicate position is intended to be unified with the graph or invoked as a function. that is independent of the mechanism through which it knows the actual return value cardinality and integrates those values into solutions.

the generation of the return values is a matter for the function definition.
how the sparql processor arranges to handle return values is independent.

see. http://clhs.lisp.se/Body/m_multip.htm

the only matter for sparql-1.2 is to define that if it is known that a term in the predicate position is a function, both a single object term and a list of object terms are valid and in the latter case, the processor must prepare for multiple return values. the decision follows from the syntax only and is independent of the mechanism which defines the function.

given the example, how is it intended that a processor interpret the following?

SELECT *
WHERE {
    schema:DayOfWeek ex:namedSuperClasses ( ?superClass ) .
}

and

SELECT *
WHERE {
    schema:DayOfWeek ex:namedSuperClasses ?superClass .
}

should they not be valid expressions?

afs commented 3 years ago

Apache Jena property functions are multi-values returns including zero. They are supposed to be like triple matching but also overload () lists. They yield multiple rows in the results at that point of execution.

There is a consequence of this. The implementation must cope with variables in subject or object position and on return all variables must be bound (in order to be "triple match like").

A proper multiple returns facility needs to distinguish "in" and "out" variables. Distinguishing "required" and "optional" would also be helpful as would named arguments to make stored procedures with many arguments easier to work with.

It could be by prior declaration or the syntax at the call point could provide this. Fully dynamic (no declaration either way, possible execution errors) is possible but would limit error checking and query optimization involving reordering near the stored procedure; I don't see advantages to fully dynamic.

lisp commented 3 years ago

but would limit error checking and query optimization involving reordering near the stored procedure

of the expressed considerations, this one would concern me.

what information could be declared which would not be covered by that which is available in the lexical context of the invocation?

lisp commented 3 years ago

A proper multiple returns facility needs to distinguish "in" and "out" variables.

an alterative is to unify: whatever values are bound in the context are passed. this can be statically evident, but need not be.

HolgerKnublauch commented 2 years ago

To be clear, the proposed design here is not to formalize magic properties. From my perspective, magic properties are a convenient hack that didn't break the SPARQL 1.0 syntax and there are some arguments in favor of mapping them to triple matches because when there is only one value on the left and right then they COULD be implemented by normal triples (e.g. as inferences). But I don't think the syntax is good, nor is the expectation that such functions can be executed in both directions always realistic. Thus, the proposal for Multi-Functions is meant as input to a future SPARQL syntax that doesn't use magic properties but a dedicated keyword. We use magic properties for the time being because there is no alternative that works with SPARQL 1.1 syntax.

And yes I agree with Andy that these shouldn't be fully dynamic. In our framework, the multi-functions are installed ahead of time from dedicated graphs that the SPARQL engine knows about. This means that syntax checking (including sh:optional, sh:class, sh:datatype etc) could be done at edit time and parse time. It also shouldn't be required for the data graph to include the dash:MultiFunction declarations before they can be used.

frensjan commented 2 years ago

To add some use case; I'm adding some trickery in an RDF4J query optimizer pipeline to support stuff like:

BIND ( generate:range(1, 5) AS ?foo )

VALUES ?bar {
    "x"
    "y"
    "z"
}

BIND ( concat( "foo/", str(?foo), "/", ?bar ) AS ?res )

to yield output:

foo | bar | res
1   |   x | foo/1/bar/x
1   |   y | foo/1/bar/y
1   |   z | foo/1/bar/z
2   |   x | foo/2/bar/x
...
5   |   z | foo/1/bar/z
1   |   x | foo/1/bar/x
...

Ideally also the following would work:

VALUES ?bar { ... }
BIND ( concat( "foo/", str( generate:range(1, 5) ), "/", ?bar ) AS ?res )

I guess this illustrates the multi output / flat map type of behaviour I was after.

I guess this is a different use case from yielding multiple outputs from a function to be then be captured into multiple variables. This seems to have more similarity with tuple unpacking / destructuring in e.g. python, C# and javascript. For that use case a new syntax would be good.