Closed jacoscaz closed 4 years ago
I would propose to use a little bit different approach. You can have a look at this branch, where I quickly hacked together some interfaces. It lacks some details but should cover the core concept: A TermFilter
instance has a test
method that can be used to check if a Term
fulfills the criteria of the filter. There are two reasons why I would favor that approach:
{
type: 'customWhitespaceBeginEnd',
test: term => term.value === term.value.trim()
}
match
method but you don't have any optimized code for the test, then check the filter. If you have optimized code in your match
method you can also find it there.@jacoscaz I very much like this idea! But I feel like the current interface is a bit too restrictive and may not be future-proof enough for other types of filters.
As I see it, this PR is a first step towards pushing filters (in this case range-based filters) into the Source itself. As such, I would propose to have a look at the SPARQL algebra, as that gives us a very good example of what kind of filters are possible, and how you can declaratively defined them. The SPARQL algebra uses declarative expressions to achieve filtering. (see here the algebra typings that Comunica uses) For example, a range-based filter could be represented as an OperationExpression.
To make feature-detection of filter-based matching easier, I would propose to add an optional method to the Source
interface. This could for example look like this:
interface Source {
matchExpression?(subject: Term, predicate: Term, object: Term, graph: Term, expression: Expression): Stream
}
I'm wondering if now would be a good time to also start thinking about (not necessarily defining) how ordering could be requested as well? Because that may influence this PR as well. Perhaps we can also add a parameter for this to matchExpression
?
@bergos I would actually vote against such a TermFilter
because it complicates query optimization.
IMO, users (e.g. query engines) of Sources should be able to detect whether or not a Source has a dedicated/efficient implementation of a given filter. If such an implementation is available, a user could make use of it. If no such implementation is present, then the a user could decide to call the regular .match
function instead, and perform any kind of filtering manually (which may potentially have certain kinds of custom optimizations).
Furthermore, this availability of such a dedicated filter implementation could even influence the query plan within query engines. So it is really critical that this information is explicitly available.
@bergos thank you for your example. I now realize I had misunderstood what you have written on Gitter - apologies. Nonetheless, here's my take on what you suggest.
Changing the Source
interface to accept TermFilter
s instead of extending it into a different interface would, IMHO, naturally lead to the reasonable expectation that any spec-compliant implementation of Source
would come with baked-in support for TermFilter
s.
On one side, the viability of this would depend on having a shared reference implementation of a filtering iterator that developers of non-optimized Source
implementations would be able to use to quickly add support for in-memory filtering.
On the other side, it would still hide whether support for filtering at the persistence layer (optimized) is present within any given Source
implementation, making it impossible to implement informed query planning at a higher level. Incidentally, this is also why something like a getApproximateCount()
method is needed.
All in all, I don't think in-memory filtering should be treated as an equal alternative to persistence-levels filtering.
@rubensworks
I like your idea of adding a different method to Source. From the perspective of letting consumers know about whether support for persistence-level filtering is present within a given Source
implementation I think both this approach and extending Source
into a new interface work equally well and I would be ok with either.
With regards to what you propose, I like the general direction in which you're going and I definitely agree that sorting does belong to this discussion. However, I would be wary of adding too much complexity to an interface that is otherwise wonderfully simple. In the case of the expression
parameter, how would I have to go at representing which expressions are supported and which are not? I agree that explicit support for Range
and Term
params alone might be restrictive but, IMHO, it would allow the spec to maintain the current level of simplicity - which I think is a feature, not a bug.
All this is strictly my opinion. I am happy to push the conversation forward. Thank you both for pitching in!
EDIT: grammar fixes
I think both this approach and extending Source into a new interface work equally well and I would be ok with either.
Sure, that's true. I would also be fine with either option.
I agree that explicit support for Range and Term params alone might be restrictive but, IMHO, it would allow the spec to maintain the current level of simplicity.
True, for this specific feature, just doing this would be the simplest thing. The problem is that there are many of these simple things that are possible to add. The complexity here does not lie in the separate features, but how these features should be handled and combined. So if we were just to add this range filter to the spec, and nothing more in the future, then I'd fully agree with your current solution. However, I don't think this approach will scale to many features, which is why I think we should introduce a robust framework in which other features could be added later on, and I think SPARQL algebra could be a good basis/inspiration for this. We should not reinvent the wheel :-)
Let me give a simple example:
const source = MySource();
source.matchExpression(namedNode('...'), null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
]
});
Spec-wise, this would require defining a list of the allowed operators, and what kind of args they allow. Just like in SPARQL, this would easily allow us to create more complex nested expressions, where each arg could become another operator expression.
@rubensworks I intuitively agree and also defer to your expertise. I would suggest keeping the list of operators as small as possible in order to be able to make support for all operators mandatory, at least in the beginning. If an operator is there, all implementors of either the additional interface or the additional method should support it.
Let's wait a day or so for others to have a chance to pitch in and then I'll update the PR.
I think it would be best if people express the expectations on this API. Then we can define requirements and derivate proposals for an API. Some thoughts after the first feedback:
.match()
method and no info if it's optimized or not.About the feedback to my proposal. As I said, it's very quick hacked together and with some slight changes, but keeping the core concept, it can handled all mentioned requirements.
@rubensworks
users (e.g. query engines) of Sources should be able to detect whether or not a Source has a dedicated/efficient implementation of a given filter. If such an implementation is available, a user could make use of it. If no such implementation is present, then the a user could decide to call the regular .match function instead, and perform any kind of filtering manually (which may potentially have certain kinds of custom optimizations). Furthermore, this availability of such a dedicated filter implementation could even influence the query plan within query engines. So it is really critical that this information is explicitly available.
That would be possible by adding a method to check if the given arguments will use dedicated filter code. I fear more that your proposal will require to implement every kind of filter, but maybe one implementation only likes to handled time series and therefore only filters ranges of datetime.
@jacoscaz
Changing the Source interface to accept TermFilters instead of extending it into a different interface would, IMHO, naturally lead to the reasonable expectation that any spec-compliant implementation of Source would come with baked-in support for TermFilters.
I was just lazy. I'm also for a new interface and maybe even different method.
On one side, the viability of this would depend on having a shared reference implementation of a filtering iterator that developers of non-optimized Source implementations would be able to use to quickly add support for in-memory filtering.
It's actually the other way around. With the filter logic in the .test
method, the code is almost the same as with the current interface for the Term
objects. It should be that simple that no shared code is required. Of course you are free to do so.
On the other side, it would still hide whether support for filtering at the persistence layer (optimized) is present within any given Source implementation, making it impossible to implement informed query planning at a higher level. Incidentally, this is also why something like a getApproximateCount() method is needed.
Could be added with an additional method.
All in all, I don't think in-memory filtering should be treated as an equal alternative to persistence-levels filtering.
But that's already the case with the current definition of the .match
method.
Would it help to bring back idea of low level and high level interfaces? I have impression that what @rubensworks proposes looks like lower level interface while @jacoscaz's looks little more high level.
source.matchExpression(null, null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
]
});
seems to map roughly to
store.match(null, null, {
lte: factory.literal("2", "http://www.w3.org/2001/XMLSchema#decimal"),
}, null);
If it's allowed to support only some kind of filters, how would you like to have the feature detection? This can be very tricky, e.g. one implementation may focus only on time series data and support range filters only on specific properties.
I think we need extensible and discoverable interface, possibly some stores would want to implement very specific feature, eg. geospatial queries.
@elf-pavlik
Would it help to bring back idea of low level and high level interfaces? I have impression that what @rubensworks proposes looks like lower level interface while @jacoscaz's looks little more high level.
I agree and, considering this PR is for a low-level spec, I am inclined to disregard my own proposal in favor of something more foundational like what @rubensworks is proposing.
I think we need extensible and discoverable interface, possibly some stores would want to implement very specific feature, eg. geospatial queries.
I agree, and I think we could reconcile this with the above by adding a dedicated method to detect support for a given operator, as per @bergos suggestion.
I would also simplify @rubensworks 's matchExpression
example by removing support for variables, which as far as I understand might be better dealt with in a higher-level spec, and simply passing the filter definition as if using a Term
instance:
source.matchExpression(null, null, {
operator: '<=',
args: [
literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
]
}, null);
Then, I would add something like:
source.supportsMatchExpressionOperator({
operator: '<=',
args: [
literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
]
});
which would return true
or false
.
EDITED: dropped variable()
from the examples.
@bergos @rubensworks @elf-pavlik would the following make everyone happy?
A matchExpression
method, similar to match
but adding support for Filter
-type arguments.
source.matchExpression(null, null, {
operator: '<=',
args: [
literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
]
}, null);
A supportsMatchExpression
method that returns true
or false
according to whether the implementing Source
instance can optimize a query at the persistence layer given Filter
-type arguments.
source.supportsMatchExpression({
operator: '<=',
args: [
literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
]
});
The way I am looking at it, the two of these combined get us the following:
What do you all think?
EDITED: I forgot - to accurately represent what is going on, I would also foresee the development of a shared Transform
stream, possibly using @RubenVerborgh 's asynciterator
, implementing all documented operators in-memory. This would allow developers to easily but explicitly add in-memory support for matchExpression
queries should their Source
implementations of choice not support an operator they need.
@jacoscaz I very much like the supportsMatchExpression
!
It indeed allows implementors to only support certain operators.
It feels very similar to Comunica's internal architecture for signaling operator support, so this would fit in perfectly on my end.
The only concern I have here is if we don't need the S/P/O/G parameters in here anymore. This is related to my second comment.
On the one hand, I like your more compact proposal. On the other hand, I think this may be too restrictive to handle certain use cases. For example, the following would be possible to represent in my previous proposal, but not in this new compact proposal:
source.matchExpression(variable('s'), null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
variable('s)
]
});
So I don't really agree with the following comment:
supports further extensions and complex terms (concern of @rubensworks) without going all the way towards SPARQL algebra (i.e. variable support), which could then be comfortably implemented at a higher-level.
As the example above shows, the compact proposal is not as expressive as the previous proposal. So the variable-based proposal can not be implemented based on the compact proposal.
As I see it, the more compact representation would be a very valuable more high-level representation that could be implemented internally using the previous (low-level) proposal.
Is my concern clear to you?
I would also foresee the development of a shared Transform stream
Perhaps we can discuss this in a separate issue/PR that would be a dependency of this PR?
@rubensworks
It feels very similar to Comunica's internal architecture for signaling operator support, so this would fit in perfectly on my end.
Happy to hear this! Another place where we're starting to have consensus!
Is my concern clear to you?
Yes, and thank you for taking the time exemplify your argument - I honestly appreciate it. I think I understand the extent of the limitation that my "compact proposal" brings. To sum it up, it does not allow multi-term filtering expressions within the same query.
That was intended on my part in order to try and keep the matchExpression
method as simple to grasp as possible. I don't think I have ever seen support for multi-term expressions at the persistence/optimized level in JS-land, not even in https://github.com/levelgraph/levelgraph if I remember correctly, whereas I have seen it implemented in-memory multiple times (Comunica included). That's why I am looking at that feature as something that could be implemented in-memory by higher-level query engines and therefore something I'd be willing to sacrifice in favor of simplifying the API.
Complexity, I think, is one of the major pain points in RDF-land and I might very well be exceedingly biased against it.
All this said, I'd be happy to align with the less-concise version of matchExpression
if you believe that multi-term filter expressions are likely to be easily implemented. Again, I have come to this world much later than you have.
Perhaps we can discuss this in a separate issue/PR that would be a dependency of this PR?
Yes, although I think that is a discussion that we should enter only if we generally agree on the idea and on its complementary nature to what we're discussing in here.
Upon reflection, the proposed Transform
stream should probably be a bit smarter. It should be able to delegate supported operations to the Source
it is applied to while executing unsupported operations in-memory.
To sum it up, it does not allow multi-term filtering expressions within the same query.
Exactly.
I don't think I have ever seen support for multi-term expressions at the persistence/optimized level in JS-land
Indeed, I have not seen that yet either in JS.
All this said, I'd be happy to align with the less-concise version of matchExpression if you believe that multi-term filter expressions are likely to be easily implemented.
I agree with your assumption that multi-term filter expressions are not going to be supported often by sources. However, that does not mean that none of them will support it. IMO, the purpose of the RDFJS low-level APIs should be to enable implementations with a high level of (potential) expressivity, without making too much assumptions.
In the world of SPARQL query optimization, pushing down filters (in their most general form) to the storage layer is a very typical form of optimization. So I think we should remain as open as possible there.
I'm not sure if Apache Jena supports pushing down of filters in ARQ (I would assume they do). If so, perhaps we can have a look at their API to get some inspiration whether or not we are on the right track.
@rubensworks
In the world of SPARQL query optimization, pushing down filters (in their most general form) to the storage layer is a very typical form of optimization. So I think we should remain as open as possible there.
I'm still afraid of what could be perceived as an overly-complex API (wrongfully, but that's often the case) but I appreciate your point and I'm happy to go along with your proposal.
Two open questions for me:
1) Shouldn't the supportsMatchExpression
method accept the same arguments as the matchExpression
method?
2) In case of multiple expressions, only one of which is not supported by an implementation of Source
, how should we model the error
so that a query-engine could rework the query to do the unsupported part in-memory?
I'm still afraid of what could be perceived as an overly-complex API
Let's let this rest a bit to hear what other people think about this.
Shouldn't the supportsMatchExpression method accept the same arguments as the matchExpression method?
I would say yes.
In case of multiple expressions, only one of which is not supported by an implementation of Source, how should we model the error so that a query-engine could rework the query to do the unsupported part in-memory?
Instead of throwing an error, supportsMatchExpression
could also just return a boolean.
In that case, a query engine could iteratively lower the complexity of expressions as follows (for example):
if (source.supportsMatchExpression(variable('s'), null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
variable('s)
]
})) {
return source.matchExpression(variable('s'), null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
variable('s)
]
});
} else if (source.supportsMatchExpression(variable('s'), null, null, null, {
operator: '<=',
args: [
literal('0'), // Just a dummy literal
variable('s)
]
})) {
const objects = source.match(null, null, variable('o'), null));
return objects.map(object => source.matchExpression(variable('s'), null, object, null, {
operator: '<=',
args: [
object,
variable('s)
]
}));
} else {
const all = source.match(null, null, null, null));
// Use full in-memory filtering
}
Instead of throwing an error, supportsMatchExpression could also just return a boolean. In that case, a query engine could iteratively lower the complexity of expressions as follows [...]
I guess this brings in another question. Let's say we have a composite query such as what follows:
source.supportsMatchExpression(variable('s'), null, variable('o'), null, {
operator: 'AND',
args: [
{ operator: '>=', args: [ variable('o'), variable('s') ] },
{ operator: 'REGEXP', args: [ variable('o'), literal('regex here') ] },
]
})
This could fail in 3 ways:
I think supportMatchExpression
should give some sort of indication as to what part of the query is not supported. How else would a query engine (or anything else) be able to figure out what to do in-memory and what to delegate to the Source
implementation?
Do we consider that expressions can have other expressions as arguments? In that case I think support for all of those expressions may need to get checked separately first. As I understand @rubensworks example, query engine could simply have already all the alternative querying paths and than just check support for them in specific order. Reconstructing queries based on errors from support detection seems bit more complex. @jacoscaz do you have a code snippet which could showcase how you would imagine using those error messages?
@elf-pavlik
Reconstructing queries based on errors from support detection seems bit more complex.
Yes, although I believe it would still be less complex than working out exactly what is supported by the Source
implementation based on a boolean
result that only points to whether the implementation supports a query in its entirety.
I'll go back to my last example:
source.supportsMatchExpression(variable('s'), null, variable('o'), null, {
operator: 'AND',
args: [
{ operator: '>=', args: [ variable('o'), variable('s') ] },
{ operator: 'REGEXP', args: [ variable('o'), literal('regex here') ] },
]
})
For this query, I've hypothesized three failure modes:
If I were constrained to using the supportsMatchExpression
method and I were working with an implementation of Source
that does not support multi-term expression but does support everything else, I would appreciate something like the following:
> console.log(source.supportsMatchExpression(/* ... */))
{
supported: false,
reasons: [
{ feature: MULTI_TERM_EXPRESSION, path: args[0].args[1] }
]
}
The reasons
array, which could perhaps be optional, would save me a ton of effort in figuring out how to work around what is missing.
The
reasons
array, which could perhaps be optional, would save me a ton of effort in figuring out how to work around what is missing.
Do you mean figure out as developer trying queries manually or you would want to write code which based on those reasons would try different queries?
@elf-pavlik both, although a developer would presumably look at the documentation for the given Source
implementation and easily figure it out. My concern is anything that might want to optimize queries across multiple implementation of Source
, such as query engines.
I don't have experience with writing query engines. Simple example provided by @rubensworks, trying specific queries in order, looks quite approachable. I'd like to see comparable snippet where code tries to create query based on reason from failing support check.
On the other hand, if reason could stay optional I don't see big difference between returning boolean
vs. object
with supported
property. I think the difference may come from having reason required or optional. In that case query engine which can adjust based on those reasons would still need to handle cases where they don't get the reason in result of support check.
@elf-pavlik I was trying to write such a snippet and, in the process of doing so, figured I am probably wrong.
I was under the impression that @rubensworks query might have been too easy to progressively simplify as it uses a common operator (>=
) and single-level expressions (no logical operators such as AND
and OR
). Nested and more complex expressions, I thought, would be much harder to deal with as one would not know where to start from.
However, assuming simplification can always be done from the outside to the inside, peeling back each layer of the query, perhaps even the most complex of queries could simply be dissected into supported sub-queries joined by in-memory implementations dealing with the unsupported outer expressions.
Incidentally, I think this lays out a really elegant architecture for a recursive function that walks the expression tree and stops the recursion whenever it finds a supported expression.
@rubensworks - seems like you already had this proposal in your head!
I still favor having a TermFilter
interface with a test
method.
You are mainly talking about the use case of a query engine with only known types of operations.
The plain object structure would support custom operations, but it would require to change the code of the match methods.
If that is not possible a fallback must be implemented.
That might turn every usage of the interface into some kind of query engine.
It will get very complex for use cases where different Store
implementations with different supported operations are used.
The mentioned Transform can also not solve the problem because it will only support a predefined set of operations and no custom filters.
The TermFilter
interface would not only reduce the complexity for these use cases, but it would also fit better to the existing Term
interface which has an equals
method.
I thought it might be good to try the concept, that's why I wrote a quick proof of concept which you can find here.
Besides the names, the properties of the TermFilter
objects are the same as in the discussion in the last comments, so there should be no drawback for the query engine use case.
A factory is used to create the TermFilter
instances, which makes the code even more readable than the plain objects.
See the readme for more details.
@bergos
First of all, thank you for your example.
You are mainly talking about the use case of a query engine with only known types of operations. The plain object structure would support custom operations, but it would require to change the code of the match methods.
True, although I think this would happen anyway in your _acceleratedMatchFilter
function. At one point or another, optimizing support for a given filter at the persistence level depends on having custom code for such filter. At least as far as I understand.
If that is not possible a fallback must be implemented. That might turn every usage of the interface into some kind of query engine.
This is very true and part of the reason why I think we need different methods (i.e. we should not pollute match
) and we might also need a shared implementation of one such query engine, unless we were to add a list of mandatory operators that implementors must support (such as what happens when a database aims to comply with a given standard). I prefer the former alternative as it allows implementors to only optimize for a subset of operators while also being able to quickly add non-optimized support for all of them.
The mentioned Transform can also not solve the problem because it will only support a predefined set of operations and no custom filters.
I am not sure I follow. As a developer who is thinking about how to write a query, I need a-priori knowledge of which operators are supported by the database I am working with. As a developer of the database itself, I need to choose which operators to optimize for from a list of known operators.
I guess what I am trying to say is that custom filters are bound to be handled in-memory as databases cannot optimize for filters they are not aware of. If we are to support optimization, we need to produce a list of available operators / expressions for which implementors might choose to optimize.
All this said, I am strictly focusing on the stream
spec and I think that what you have exemplified fits very well within the dataset
spec. The latter, although still defined as a low-level spec, seems to me to be a little further from the metal in terms of abstraction. dataset
seems to focus on working with quads that have already been retrieved from whatever source
one is dealing with.
@jacoscaz
assuming simplification can always be done from the outside to the inside, peeling back each layer of the query. I think this lays out a really elegant architecture for a recursive function that walks the expression tree and stops the recursion whenever it finds a supported expression.
Yep, that's what I had in mind, you just explained it much better than I did 🙂.
@bergos @jacoscaz
I think we need different methods (i.e. we should not pollute match)
👍
I think that what you have exemplified fits very well within the dataset spec.
I agree with this. The TermFilter
is a useful abstraction, but IMO fits better on a higher level, such as the Dataset spec.
By letting sources execute optimized filters, and requiring consumers to perform non-optimized in-memory filters themselves, this would lead to a cleaner separation of concerns IMO.
@jacoscaz @rubensworks Can you please explain where there are limitations in my proposal that make it unusable for your use case?
fits better on a higher level, such as the Dataset spec.
Even if you like to implement only accelerated filters in Source
, I don't think it's a good idea to define two different interfaces for it. We should keep it in mind to keep the APIs consistent.
@bergos I don't see them as limitations but more as bad ergonomics.
As filters are created per-term, one issue I would struggle with is how to easily represent multi-term expressions, which I think, and I've come to agree with @rubensworks on this, must be supported at the lowest possible level of the RDF/JS spec (i.e. the one closest to the metal).
Your definition of TermFilter
and the way I would find myself using it almost makes it look as if, no matter what filter you use, there will never be any difference in terms of performance. Correct me if I am wrong but this, to me, seems intentional given what you have written in your example:
matchFilter (subject, predicate, object, graph) {
if (this.isAcceleratedMatchFilter(subject, predicate, object, graph)) {
return this._acceleratedMatchFilter(subject, predicate, object, graph)
}
Considering the low-level, close-to-the-metal nature of the Source
spec, I do not think this is the right approach.
There should be an explicit difference and a clear separation of concern between what can be handled at the persistence level and what can be handled in-memory. In my mind, the goal of the Source
spec is to correctly convey what can be done with a given implementation of such interface, including which operators are supported so that I can write my queries accordingly (or structure my indexes accordingly, which quadstore
will soon allow you to do).
I look at Source
as something equivalent to the SPARQL
standard or the SQL
standard, where I am always able to find a clear list of supported expressions / operators, with implementors of those standards often warning that they individually support only a subset of those operators.
That said, I find TermFilter
to be very useful at the Dataset
level, where I expect developers not to care as much about the performance aspects of what they are writing.
I would struggle with is how to easily represent multi-term expressions
Do you mean comparing two terms of the same quad? It's even much easier than in your proposal:
store.matchFilter(rdf.variable('s'), null, filter.gt(rdf.variable('s'))
Your definition of TermFilter and the way I would find myself using it almost makes it look as if, no matter what filter you use, there will never be any difference in terms of performance.
No, you can ask before with isAcceleratedMatchFilter
if it would run accelerated. If it doesn't run accelerated, it's the same as .match
.
There should be an explicit difference and a clear separation of concern between what can be handled at the persistence level and what can be handled in-memory.
That was never the intention of the Source
interface. For example you could also use the Source
interface to generate quads based on an algorithm and process it in a streaming way so you don't have to keep all of the quads in-memory at the same time.
In my mind, the goal of the Source spec is to correctly convey what can be done with a given implementation of such interface, including which operators are supported so that I can write my queries accordingly (or structure my indexes accordingly, which quadstore will soon allow you to do).
There is already the .match
method, that means my proposal is actually more close to the current definition of the Source
interface. With the isAcceleratedMatchFilter
method it should also fulfill your requirements.
I really would like to have more advanced filtering in the Source
interface, but your proposal doesn't cover the full scope of the Source
interface and I'm strongly against limiting the interface to a reduced set of use cases. If you think you can't accept a compromise and insist on your approach, I would propose to do it outside the stream-spec.
@bergos
Can you please explain where there are limitations in my proposal that make it unusable for your use case?
I agree with @jacoscaz. I don't see any limitations with your proposal. I consider both proposals having a similar level of expressivity.
your proposal doesn't cover the full scope of the Source interface and I'm strongly against limiting the interface to a reduced set of use cases.
How so? AFAICS they are both as expressive. Unless I'm missing something?
store.matchFilter(rdf.variable('s'), null, filter.gt(rdf.variable('s'))
I definitely agree that this (proposal 1) is easier to write than something like this (proposal 2):
source.matchExpression(variable('s'), null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
variable('s')
]
});
However, I think we're comparing different levels of abstractions here.
We could easily introduce a similar helper tool that produces the expression from proposal 2, e.g.: filter.lte(variable('o'), variable('s))
.
So I would not focus too much on these helper methods, but on the format they should produce, and what should be passed to matchExpression
.
As I see it, the main (remaining) difference between proposal 1 and 2 is on how they handle expressions that are not supported by the Source
.
While proposal 1 incorporates a filter argument that can be used as a fallback by the Source
, proposal 2 requires the Source
to do nothing, and the consumer to do the filtering itself.
Am I correct in my analysis of these differences? If so, then it seems to me like proposal 2 could easily be built on top of proposal 1, right? My gut feeling says that the declarative (and lower-level) nature of proposal 2 would be a good basis to build upon. (Btw, this declarative structure is also followed by query engines in other languages using e.g. SPARQL algebra, so it would feel strange to not do the same)
An important advantage of using proposal 2 as a basis I just thought of is the fact that it can easily cross programming language borders, due to it being purely declarative, and is not using any JS functions. This would be extremely important when running within Web browsers and you want to delegate certain filtering to a WASM source, or when running in Node.js and you need to communicate with C/C++ over NAN.
If you think you can't accept a compromise and insist on your approach, I would propose to do it outside the stream-spec.
Don't worry, we'll get there 🙂
@bergos
Do you mean comparing two terms of the same quad? It's even much easier than in your proposal:
store.matchFilter(rdf.variable('s'), null, filter.gt(rdf.variable('s'))
I personally do not find what you suggest any easier to grasp than having the expression represented separately from the terms. This is probably a simple matter of taste and/or habit, though. I terms of expressive power, I cannot see any difference between the two.
If you think you can't accept a compromise and insist on your approach, I would propose to do it outside the stream-spec.
That is not a direction I would like to take. I appreciate what RDF/JS is trying to achieve and I am trying to contribute to the same goal. I might argue against or towards specific solutions, even passionately at times, but that's because I honestly believe they represent superior compromises. I hope I have not come across as exceedingly or needlessly confrontational; if so, that was not my intention and I apologize.
I fundamentally agree with the following points @rubensworks has just made:
While proposal 1 incorporates a filter argument that can be used as a fallback by the Source, proposal 2 requires the Source to do nothing, and the consumer to do the filtering itself.
My gut feeling says that the declarative (and lower-level) nature of proposal 2 would be a good basis to build upon. (Btw, this declarative structure is also followed by query engines in other languages using e.g. SPARQL algebra, so it would feel strange to not do the same)
I hope I have not come across as exceedingly or needlessly confrontational; if so, that was not my intention and I apologize.
Sorry, but I'm a bit tired of repeating stuff again and again, spending time to show alternatives and nobody takes the time to have a look. It seems to me that some people are only interested to make the interface as close as possible to their existing code and ignore other cases.
First one important fact: Everyone has to implement the match method, no matter if it runs accelerated or not. That's already part of the current spec. Adding support for TermFilter would be very easy and I showed that already in my PoC.
My gut feeling says that the declarative (and lower-level) nature of proposal 2 would be a good basis to build upon. (Btw, this declarative structure is also followed by query engines in other languages using e.g. SPARQL algebra, so it would feel strange to not do the same)
The TermFilter proposal contains all properties of the other proposal, the only difference is an additional method. That means you have your declarative structure.
And again you are focusing on query engines. There are many use cases where a simple match is enough, still the TermFilter proposal would contain everything required for your query engine use case.
An important advantage of using proposal 2 as a basis I just thought of is the fact that it can easily cross programming language borders, due to it being purely declarative, and is not using any JS functions. This would be extremely important when running within Web browsers and you want to delegate certain filtering to a WASM source, or when running in Node.js and you need to communicate with C/C++ over NAN.
You have all the properties for code that runs accelerated. Anyway you have to translate the Quads again in JS to have full featured RDF/JS Quads. The RDF/JS spec has been written for JavaScript and it was clearly stated that the spec focuses on JS only. I agree that we may align it a little bit to WASM use cases if possible, but it doesn't make any sense to give up a consistent API for a filter when all the Quads don't have a plain object structure. If there is any benefit for a plain object, it doesn't matter compared to the Quads. I haven't done much WASM stuff yet, but anyway I guess it's a common pattern to convert plain objects to full featured classes in both ways.
Sorry, but I'm a bit tired of repeating stuff again and again, spending time to show alternatives and nobody takes the time to have a look. It seems to me that some people are only interested to make the interface as close as possible to their existing code and ignore other cases.
In the interest of clarity and transparency, I did look at your example and I am not trying to make the interface as close as possible to my code as I don't even have a well-established codebase to refer to.
Everyone has to implement the match method, no matter if it runs accelerated or not.
Agreed and I think we are all in agreement about this.
Adding support for TermFilter would be very easy and I showed that already in my PoC.
I agree that this is easy, although it still doesn't seem practical or ergonomic to me. I have already explained why and I don't think it would help if I were to repeat myself. I understand we do not agree on this and I am happy to come to your side if this means pushing this matter forward. The expressive power of the two solutions is still the same AFAIU.
In the spirit of building consensus towards a solution we can all agree with, a few questions:
TermFilter
, we would still have to expand the spec with the set of supported operators, the set from which implementors might choose which operators to optimize for?test()
fallback for most common filters would be a nice addition to the ecosystem? I think it would come helpful if we pick 6 different real world examples with increasing level of complexity and than write equivalent snippets using both of discussed approaches. Based on that we may have better ground to evaluate pros and cons of those alternative approaches. I think any of your three could propose first (simplest) example and then next person would add another little more complex and so on.
@elf-pavlik that's a worthy exercise, I think. Here's a very simple example:
source.matchExpression(null, nameNode('lastSeen'), variable('o'), null, {
type: '>=',
args: [
variable('o'),
literal('2020-01-01', 'xsd:dateTime'),
],
});
source.matchFilter(
null,
namedNode('lastSeen'),
filters.gt(literal('2020-01-01', 'xsd:dateTime')),
null,
);
At this level of complexity, I think the two alternatives are almost 100% equivalent in terms of ergonomics and practicality. Actually, at this level of complexity I think @bergos' form is better as it is more consistent with the rest of the spec, although I would still maintain that the gt
filter would have to be an implementation a filter defined within the spec itself.
@bergos
Sorry, but I'm a bit tired of repeating stuff again and again, spending time to show alternatives and nobody takes the time to have a look
I'm sorry you feel that way. I did look at your implementation, but I do not agree with certain parts of it, which is why we are having this discussion.
Adding support for TermFilter would be very easy and I showed that already in my PoC.
Easy from who's perspective? I have already shown that it's not easy at all in some cases, such as the NAN/WASM cases, and handling unwanted filter support.
The TermFilter proposal contains all properties of the other proposal
Almost, there's still the difference in how the expressions are represented. Your proposal incorporates this into terms, which is less expressive than the other proposal, as @jacoscaz and I had discussed earlier.
the only difference is an additional method. That means you have your declarative structure.
Indeed, and that's the point we disagree on. This additional method voids the declarative structure IMO, and complicates things significantly.
The equals()
method on quads and terms is related to this,
which IMHO does more harm than good.
Personally, I ran into more problems related to it (WASM, NAN, ...),
than cases where I actually was able to use it.
As such, I strongly disagree on adding such a method here as well.
I'm of the opinion that this test method can easily be moved to a higher level, but it should IMO not become part of the low-level spec.
Another major problem with this test method I just realized is that it requires
consumers to implement these test methods themselves, even if the consumer does not want to support certain types of filters.
By remaining fully declarative, the Source
would fail upon encountering an unsupported filter,
and the consumer could easily detect this failure, and handle the error accordingly.
it doesn't make any sense to give up a consistent API for a filter when all the Quads don't have a plain object structure.
The difference with the equals()
method of quads and terms is that this can be hooked in via prototypes,
as there is only a small number of possible equals()
implementations.
For filters on the other hand, the number of possible implementations is infinite,
which makes deserialization from plain objects much trickier.
A fully declarative approach would not have this problem.
@jacoscaz
- would you agree with the fact that, if we were to proceed with TermFilter, we would still have to expand the spec with the set of supported operators, the set from which implementors might choose which operators to optimize for?
The list of operators seems like a must to me indeed.
- would you agree with the fact that a shared implementation of the in-memory test() fallback for most common filters would be a nice addition to the ecosystem?
Yes, I agree that this would be nice to have. However, it should remain optional, and should not be part of the low-level spec IMO.
@elf-pavlik
I think it would come helpful if we pick 6 different real world examples with increasing level of complexity and than write equivalent snippets using both of discussed approaches.
That may indeed be a helpful exercise :-)
Let me continue upon @jacoscaz's snippet:
source.matchExpression(null, nameNode('lastSeen'), variable('o'), null, {
type: '&&',
args: [
{
type: '>=',
args: [
variable('o'),
literal('2020-01-01', 'xsd:dateTime'),
],
},
{
type: '<=',
args: [
variable('o'),
literal('2021-01-01', 'xsd:dateTime'),
],
}
],
});
source.matchFilter(
null,
namedNode('lastSeen'),
filters.and(
filters.gte(literal('2020-01-01', 'xsd:dateTime')),
filters.lte(literal('2021-01-01', 'xsd:dateTime'))
),
null,
);
This is still doable in both approaches.
Note that I don't consider the first option to be more complex than the second.
One could easily add helper functions such as @bergos's filters
to do something like this (modulo test functions):
source.matchExpression(null, nameNode('lastSeen'), variable('o'), null,
filters.and(
filters.gte(variable('o'), literal('2020-01-01', 'xsd:dateTime')),
filters.lte(variable('o'), literal('2021-01-01', 'xsd:dateTime'))
)
);
Perhaps we can try an expression with two variables as next use case?
would you agree with the fact that, if we were to proceed with TermFilter, we would still have to expand the spec with the set of supported operators, the set from which implementors might choose which operators to optimize for?
From my readme:
Besides the TermFilter interface, a set of filters should be included in the specification. The filters.js file contains a set of filters that may be good candidates to guaranteed interoperability for accelerated execution.
would you agree with the fact that a shared implementation of the in-memory test() fallback for most common filters would be a nice addition to the ecosystem?
Maybe.
I think it would come helpful if we pick 6 different real world examples with increasing level of complexity and than write equivalent snippets using both of discussed approaches.
That doesn't matter much, cause helper functions could be used so they look the same in the end. It's more important what happens after the method call. That should be compared with cases where only the match
method is used and with cases where the new filtering is used. My repository contains already some examples incuding one with a match
method call. I think somebody supporting the other proposal should translate the examples to the other proposal.
I'm sorry you feel that way.
Let me quote myself:
I'm a bit tired of repeating stuff again and again
and
Everyone has to implement the match method, no matter if it runs accelerated or not. That's already part of the current spec. Adding support for TermFilter would be very easy and I showed that already in my PoC.
and
The RDF/JS spec has been written for JavaScript and it was clearly stated that the spec focuses on JS only.
The difference with the equals() method of quads and terms is that this can be hooked in via prototypes, as there is only a small number of possible equals() implementations. For filters on the other hand, the number of possible implementations is infinite, which makes deserialization from plain objects much trickier. A fully declarative approach would not have this problem.
That's a comparison of apples with pears. With the TermFilter proposal you would only forward the known filters to the WASM code and that means there is no need to know the logic of the test
method. Unknown filters could be only handled in JS unless you transpile them. With your proposal you would forward only known filters. I don't see any difference.
Perhaps we can try an expression with two variables as next use case?
In one of earlier comments I see one example, I've change operator to make comparing with IRI of subject more realistic:
source.matchFilter(
variable('s'),
null,
filter.eq(variable('o'),
null
)
source.matchExpression(variable('s'), null, variable('o'), null, {
operator: '==',
args: [
variable('o'),
variable('s')
]
});
Looks like something that finds statements where node has some relation with itself. Could someone provide more interesting example with two variables? @jacoscaz in this example you provided some time ago I don't see how >=
supposed to work on subject which needs to be IRI
source.matchExpression(variable('s'), null, variable('o'), null, {
operator: 'AND',
args: [
{ operator: '>=', args: [ variable('o'), variable('s') ] },
{ operator: 'REGEXP', args: [ variable('o'), literal('regex here') ] },
]
})
@elf-pavlik
in this example you provided some time ago I don't see how >= supposed to work on subject which needs to be IRI
WRT the IRI issue, I am under the impression that RDF also supports "reverse" predicates, something like:
"jacopo"^^it ex:firstNameOf http://jacoscaz
If this is the case, then subjects can be IRIs, blank nodes, literal, ... I am 100% sure I have used at least one store supporting reverse predicates but that might have been an exception to the rule.
WRT the >=
applied to IRIs issue, I guess the questions would boil down to whether we intend to provide support (i.e. have an operator) for lexicographical order comparison. I've seen it used a few times but only to optimize queries for which no better operators were supported. For example, a query based on the regexp /^hello[0-9]{2}/
could be reimplemented as a query targeting terms >= "hello" AND <= "hello<boundary_char>"
, filtering out the rest in-memory.
@bergos
Besides the TermFilter interface, a set of filters should be included in the specification.
I had missed this as I was trying to quickly locate the bits of code I wanted to think about. Looks like we all agree on this point - good!
That doesn't matter much, cause helper functions could be used so they look the same in the end. It's more important what happens after the method call.
I disagree. This matters to me, although I respect the fact that it does not to you. I will build upon @elf-pavlik 's example:
source.matchExpression(variable('s'), null, variable('o'), variable('g'), {
operator: '&&',
args: [
{ operator: '==', args: [ variable('o'), variable('s') ] },
{ operator: '==', args: [ variable('g'), namedNode('http://some/graph') ] },
]
});
source.matchFilter(
variable('s'),
null,
filter.eq(variable('s')),
filter.eq(namedNode('http://some/graph'))
)
With the first, more declarative style, the entirety of the expression is contained within the fifth parameter. This makes it a lot easier for me to read (and this is subjective) but also to parse into an expression tree.
With the second style, I find it harder to understand what the combined effects of all filters is. Parsing them also seems a bit harder.
Generally speaking, this is an example of what I refer to when I use the term ergonomics. However, the two could be combined in the following:
source.matchExpression(variable('s'), null, variable('o'), variable('g'), filter.and(
filter.eq(variable('o'), variable('s')),
filter.eq(variable('g'), namedNode('http://some/graph')),
));
I find this rather appealing, and this form would also lend itself better when looking at building the code that would concatenate all the .test()
fallbacks.
With your proposal you would forward only known filters. I don't see any difference.
I think the main point of contention is whether the match()
method itself (or a matchFilter
method) would be allowed to return an error in the case of an unsupported operator or whether it would have to use the .test()
method fallback because that would be mandated by the spec.
Given an implementation of Source
, I (as a developer) need to carefully write my queries around what I know to be the supported operators. If we agree on adding operators to the spec and leaving Source
implementors free to specify which ones they support, then I believe that throwing errors would be a vastly more ergonomic strategy.
EDITED: fixed a leftover/redundant/bad use of variable
in one of the examples.
If this is the case, then subjects can be IRIs, blank nodes, literal, ... I am 100% sure I have used at least one store supporting reverse predicates but that might have been an exception to the rule.
https://www.w3.org/TR/rdf11-concepts/#section-triples clearly states:
- the subject, which is an IRI or a blank node
- the predicate, which is an IRI
- the object, which is an IRI, a literal or a blank node
One could use tricks like
ex:jacosName
a rdfs:Literal;
rdf:value ""jacopo"@it .
ex:jacosName ex:firstNameOf http://jacoscaz .
And sort of get literal in subject position, but I really don't see it used there directly.
@jacoscaz could you please use language tags in your snippets for syntaxt highlighting? I let myself edit some of your recent comments s/`\
/ ```js/g
https://www.w3.org/TR/rdf11-concepts/#section-triples clearly states:
I'm really happy to stand corrected! This wonderfully simplifies a number of things on my end. And yes, I will be more careful with using language tags. Thank you fixing my posts.
I'm for closing this PR, because in most comments an approach is discussed that is against the core idea of the RDF/JS specs to define JavaScript-idiomatic APIs. If people would like to discuss the basic concepts of the RDF/JS spec, this PR is not the best place.
Other PRs to add more advanced filtering are very welcome, but use cases covering the full scope of the Source
interface should be collected first in an issue.
use cases covering the full scope of the Source interface should be collected first in an issue.
Good point, I've created #17 for this. Shall we also move the snippets started by @elf-pavlik to this issue, or shall we create a separate one for it?
I'm for closing this PR
I would actually keep this PR open, and update it with the parts that we all agree on already. Because I think we're actually more aligned than it seems. We can discuss the remaining details after that.
@bergos and all
Before closing this PR, I would kindly ask you to consider what follows.
In my previous comment I have tried to reconcile the two approaches by proposing the following:
source.matchExpression(variable('s'), null, variable('o'), variable('g'), filter.and(
filter.eq(variable('o'), variable('s')),
filter.eq(variable('g'), namedNode('http://some/graph')),
));
In my mind, this gives us:
Filter#test()
as per your concernFilter#toDeclarativeStructure()
as per @rubensworks' concern, which I shareI would also add:
matchExpression
should throw an error if an unsupported instance of Filter
is passed to itFilter
.test()
fallbacks in the case of expression with a mix of supported and unsupported Filter
instances.Could this make for a good, if perhaps incomplete, compromise?
@bergos @rubensworks if you have time to do so I'd appreciate your feedback on the above. Then, if everyone agrees, I'll probably close this PR and open a new one starting from what we find ourselves in agreement about and the remaining open questions.
@jacoscaz I agree with the above, except for the following:
support for a fallback method Filter#test() as per your concern
I'm still of the opinion that this should become part of a different spec, or at least should become optional.
there could, and should in my opinion, be a case to be made for a shared implementation of the code that would be needed to concatenate the various .test() fallbacks in the case of expression with a mix of supported and unsupported Filter instances.
Same as above.
(Apologies for the delayed answer)
As I stated out in issue about removing .equals
, the topic plain object vs. object with methods is not that easy as may look at a first glance and should therefore not be discussed in this PR. That's why I'm for the .test
method. I also understand the use cases where only optimized filters are handled, but it's not consistent with the existing Source
interface and the .match
method. I would propose to define a new interface just for that use case. Maybe it can be called OptimizedSource
, so that name covers the use case. The extension of the Source
interface, the new OptimizedSource
and the interface of the filters should be discussed together and in the end, one PR should be made for all three interfaces to avoid incompatibilities.
Hi all. Coming back to this after a few busy weeks. The more I read and think about it, the more I agree with @bergos proposal of having different interfaces.
I'm going to quote a few paragraphs from the other discussion:
Using an index to solve a triple patterns is not a query engine.
I do not agree with this limitation to the definition of a query engine. Selecting which index is better suited for a given query definitely falls within the scope of a query engine.
It should be possible to define custom filters.
I agree with the fact that custom filters should be supported but, due to their very nature, they cannot be optimized for at the persistence level.
Our use case would be on the optimization of query engines by pushing down filters within the query plan to the storage level.
This necessarily requires a descriptive approach to representing filters in such a way as to be easily translated across languages. Not doing so would greatly limit the number of viable persistence solutions.
Based on all this, it looks to me like custom filters and persistence-level optimization cannot be reconciled into a single interface. However, it also looks to me like they do fit within a hierarchy:
Both the base filtering source and the advanced filtering source can be designed to support idiomatic filters and the serialization of these filters into declarative data structures. Source
implementors dealing with persistence-level optimization would only be able to optimize up to the base filtering source level.
Filter implementations would also have to be split across two levels:
.test()
method)What do you think @bergos and @rubensworks ? Also @elf-pavlik ? Could this be a viable approach?
This is a very early draft aimed at adding support for range-based matching to the
Source
interface.The rationale for this change is that the
Source
interface is often implemented by persistence libraries (such asquadstore
) that are often used in contexts where being forced to perform in-memory filtering of quads can lead to unacceptable query latencies and CPU utilization, particularly when compared to the alternative of delegating such filtering to the persistence layer of choice (such asleveldb
in the case ofquadstore
).I am unfamiliar with the spec process,
ReSpec
andWebIDL
and I am uncertain as to how to represent the fact thatRangeSource
extendsSource
, overridingmatch()
to add support forRange
arguments. This PR is meant as a starting point to push the conversation forward on these matters. All feedback is welcome.Thank you @bergos for pitching in on Gitter!