rdfjs / stream-spec

RDF/JS: Stream interfaces – A specification of a low level interface definition representing RDF data independent of a serialized format in a JavaScript environment.
https://rdf.js.org/stream-spec/
5 stars 2 forks source link

WIP: support for range-based matching #16

Closed jacoscaz closed 4 years ago

jacoscaz commented 4 years ago

This is a very early draft aimed at adding support for range-based matching to the Source interface.

The rationale for this change is that the Source interface is often implemented by persistence libraries (such as quadstore) that are often used in contexts where being forced to perform in-memory filtering of quads can lead to unacceptable query latencies and CPU utilization, particularly when compared to the alternative of delegating such filtering to the persistence layer of choice (such as leveldb in the case of quadstore).

I am unfamiliar with the spec process, ReSpec and WebIDL and I am uncertain as to how to represent the fact that RangeSource extends Source, overriding match() to add support for Range arguments. This PR is meant as a starting point to push the conversation forward on these matters. All feedback is welcome.

Thank you @bergos for pitching in on Gitter!

bergos commented 4 years ago

I would propose to use a little bit different approach. You can have a look at this branch, where I quickly hacked together some interfaces. It lacks some details but should cover the core concept: A TermFilter instance has a test method that can be used to check if a Term fulfills the criteria of the filter. There are two reasons why I would favor that approach:

rubensworks commented 4 years ago

@jacoscaz I very much like this idea! But I feel like the current interface is a bit too restrictive and may not be future-proof enough for other types of filters.

As I see it, this PR is a first step towards pushing filters (in this case range-based filters) into the Source itself. As such, I would propose to have a look at the SPARQL algebra, as that gives us a very good example of what kind of filters are possible, and how you can declaratively defined them. The SPARQL algebra uses declarative expressions to achieve filtering. (see here the algebra typings that Comunica uses) For example, a range-based filter could be represented as an OperationExpression.

To make feature-detection of filter-based matching easier, I would propose to add an optional method to the Source interface. This could for example look like this:

interface Source {
  matchExpression?(subject: Term, predicate: Term, object: Term, graph: Term, expression: Expression): Stream
}

I'm wondering if now would be a good time to also start thinking about (not necessarily defining) how ordering could be requested as well? Because that may influence this PR as well. Perhaps we can also add a parameter for this to matchExpression?

@bergos I would actually vote against such a TermFilter because it complicates query optimization. IMO, users (e.g. query engines) of Sources should be able to detect whether or not a Source has a dedicated/efficient implementation of a given filter. If such an implementation is available, a user could make use of it. If no such implementation is present, then the a user could decide to call the regular .match function instead, and perform any kind of filtering manually (which may potentially have certain kinds of custom optimizations). Furthermore, this availability of such a dedicated filter implementation could even influence the query plan within query engines. So it is really critical that this information is explicitly available.

jacoscaz commented 4 years ago

@bergos thank you for your example. I now realize I had misunderstood what you have written on Gitter - apologies. Nonetheless, here's my take on what you suggest.

Changing the Source interface to accept TermFilters instead of extending it into a different interface would, IMHO, naturally lead to the reasonable expectation that any spec-compliant implementation of Source would come with baked-in support for TermFilters.

On one side, the viability of this would depend on having a shared reference implementation of a filtering iterator that developers of non-optimized Source implementations would be able to use to quickly add support for in-memory filtering.

On the other side, it would still hide whether support for filtering at the persistence layer (optimized) is present within any given Source implementation, making it impossible to implement informed query planning at a higher level. Incidentally, this is also why something like a getApproximateCount() method is needed.

All in all, I don't think in-memory filtering should be treated as an equal alternative to persistence-levels filtering.

jacoscaz commented 4 years ago

@rubensworks

I like your idea of adding a different method to Source. From the perspective of letting consumers know about whether support for persistence-level filtering is present within a given Source implementation I think both this approach and extending Source into a new interface work equally well and I would be ok with either.

With regards to what you propose, I like the general direction in which you're going and I definitely agree that sorting does belong to this discussion. However, I would be wary of adding too much complexity to an interface that is otherwise wonderfully simple. In the case of the expression parameter, how would I have to go at representing which expressions are supported and which are not? I agree that explicit support for Range and Term params alone might be restrictive but, IMHO, it would allow the spec to maintain the current level of simplicity - which I think is a feature, not a bug.

All this is strictly my opinion. I am happy to push the conversation forward. Thank you both for pitching in!

EDIT: grammar fixes

rubensworks commented 4 years ago

I think both this approach and extending Source into a new interface work equally well and I would be ok with either.

Sure, that's true. I would also be fine with either option.

I agree that explicit support for Range and Term params alone might be restrictive but, IMHO, it would allow the spec to maintain the current level of simplicity.

True, for this specific feature, just doing this would be the simplest thing. The problem is that there are many of these simple things that are possible to add. The complexity here does not lie in the separate features, but how these features should be handled and combined. So if we were just to add this range filter to the spec, and nothing more in the future, then I'd fully agree with your current solution. However, I don't think this approach will scale to many features, which is why I think we should introduce a robust framework in which other features could be added later on, and I think SPARQL algebra could be a good basis/inspiration for this. We should not reinvent the wheel :-)

Let me give a simple example:

const source = MySource();
source.matchExpression(namedNode('...'), null, variable('o'), null, {
  operator: '<=',
  args: [
    variable('o'),
    literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
  ]
});

Spec-wise, this would require defining a list of the allowed operators, and what kind of args they allow. Just like in SPARQL, this would easily allow us to create more complex nested expressions, where each arg could become another operator expression.

jacoscaz commented 4 years ago

@rubensworks I intuitively agree and also defer to your expertise. I would suggest keeping the list of operators as small as possible in order to be able to make support for all operators mandatory, at least in the beginning. If an operator is there, all implementors of either the additional interface or the additional method should support it.

jacoscaz commented 4 years ago

Let's wait a day or so for others to have a chance to pitch in and then I'll update the PR.

bergos commented 4 years ago

I think it would be best if people express the expectations on this API. Then we can define requirements and derivate proposals for an API. Some thoughts after the first feedback:

About the feedback to my proposal. As I said, it's very quick hacked together and with some slight changes, but keeping the core concept, it can handled all mentioned requirements.

@rubensworks

users (e.g. query engines) of Sources should be able to detect whether or not a Source has a dedicated/efficient implementation of a given filter. If such an implementation is available, a user could make use of it. If no such implementation is present, then the a user could decide to call the regular .match function instead, and perform any kind of filtering manually (which may potentially have certain kinds of custom optimizations). Furthermore, this availability of such a dedicated filter implementation could even influence the query plan within query engines. So it is really critical that this information is explicitly available.

That would be possible by adding a method to check if the given arguments will use dedicated filter code. I fear more that your proposal will require to implement every kind of filter, but maybe one implementation only likes to handled time series and therefore only filters ranges of datetime.

@jacoscaz

Changing the Source interface to accept TermFilters instead of extending it into a different interface would, IMHO, naturally lead to the reasonable expectation that any spec-compliant implementation of Source would come with baked-in support for TermFilters.

I was just lazy. I'm also for a new interface and maybe even different method.

On one side, the viability of this would depend on having a shared reference implementation of a filtering iterator that developers of non-optimized Source implementations would be able to use to quickly add support for in-memory filtering.

It's actually the other way around. With the filter logic in the .test method, the code is almost the same as with the current interface for the Term objects. It should be that simple that no shared code is required. Of course you are free to do so.

On the other side, it would still hide whether support for filtering at the persistence layer (optimized) is present within any given Source implementation, making it impossible to implement informed query planning at a higher level. Incidentally, this is also why something like a getApproximateCount() method is needed.

Could be added with an additional method.

All in all, I don't think in-memory filtering should be treated as an equal alternative to persistence-levels filtering.

But that's already the case with the current definition of the .match method.

elf-pavlik commented 4 years ago

Would it help to bring back idea of low level and high level interfaces? I have impression that what @rubensworks proposes looks like lower level interface while @jacoscaz's looks little more high level.

source.matchExpression(null, null, variable('o'), null, {
  operator: '<=',
  args: [
    variable('o'),
    literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
  ]
});

seems to map roughly to

store.match(null, null, { 
    lte: factory.literal("2", "http://www.w3.org/2001/XMLSchema#decimal"),
}, null);

If it's allowed to support only some kind of filters, how would you like to have the feature detection? This can be very tricky, e.g. one implementation may focus only on time series data and support range filters only on specific properties.

I think we need extensible and discoverable interface, possibly some stores would want to implement very specific feature, eg. geospatial queries.

jacoscaz commented 4 years ago

@elf-pavlik

Would it help to bring back idea of low level and high level interfaces? I have impression that what @rubensworks proposes looks like lower level interface while @jacoscaz's looks little more high level.

I agree and, considering this PR is for a low-level spec, I am inclined to disregard my own proposal in favor of something more foundational like what @rubensworks is proposing.

I think we need extensible and discoverable interface, possibly some stores would want to implement very specific feature, eg. geospatial queries.

I agree, and I think we could reconcile this with the above by adding a dedicated method to detect support for a given operator, as per @bergos suggestion.

I would also simplify @rubensworks 's matchExpression example by removing support for variables, which as far as I understand might be better dealt with in a higher-level spec, and simply passing the filter definition as if using a Term instance:

source.matchExpression(null, null, {
  operator: '<=',
  args: [
    literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
  ]
}, null);

Then, I would add something like:

source.supportsMatchExpressionOperator({
  operator: '<=',
  args: [
    literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
  ]
});

which would return true or false.

EDITED: dropped variable() from the examples.

jacoscaz commented 4 years ago

@bergos @rubensworks @elf-pavlik would the following make everyone happy?

A matchExpression method, similar to match but adding support for Filter-type arguments.

source.matchExpression(null, null, {
  operator: '<=',
  args: [
    literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
  ]
}, null);

A supportsMatchExpression method that returns true or false according to whether the implementing Source instance can optimize a query at the persistence layer given Filter-type arguments.

source.supportsMatchExpression({
  operator: '<=',
  args: [
    literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
  ]
});

The way I am looking at it, the two of these combined get us the following:

What do you all think?

EDITED: I forgot - to accurately represent what is going on, I would also foresee the development of a shared Transform stream, possibly using @RubenVerborgh 's asynciterator, implementing all documented operators in-memory. This would allow developers to easily but explicitly add in-memory support for matchExpression queries should their Source implementations of choice not support an operator they need.

rubensworks commented 4 years ago

supportsMatchExpression

@jacoscaz I very much like the supportsMatchExpression! It indeed allows implementors to only support certain operators. It feels very similar to Comunica's internal architecture for signaling operator support, so this would fit in perfectly on my end.

The only concern I have here is if we don't need the S/P/O/G parameters in here anymore. This is related to my second comment.

matchExpression

On the one hand, I like your more compact proposal. On the other hand, I think this may be too restrictive to handle certain use cases. For example, the following would be possible to represent in my previous proposal, but not in this new compact proposal:

source.matchExpression(variable('s'), null, variable('o'), null, {
  operator: '<=',
  args: [
    variable('o'),
    variable('s)
  ]
});

So I don't really agree with the following comment:

supports further extensions and complex terms (concern of @rubensworks) without going all the way towards SPARQL algebra (i.e. variable support), which could then be comfortably implemented at a higher-level.

As the example above shows, the compact proposal is not as expressive as the previous proposal. So the variable-based proposal can not be implemented based on the compact proposal.

As I see it, the more compact representation would be a very valuable more high-level representation that could be implemented internally using the previous (low-level) proposal.

Is my concern clear to you?

Transform

I would also foresee the development of a shared Transform stream

Perhaps we can discuss this in a separate issue/PR that would be a dependency of this PR?

jacoscaz commented 4 years ago

@rubensworks

supportsMatchExpression

It feels very similar to Comunica's internal architecture for signaling operator support, so this would fit in perfectly on my end.

Happy to hear this! Another place where we're starting to have consensus!

matchExpression

Is my concern clear to you?

Yes, and thank you for taking the time exemplify your argument - I honestly appreciate it. I think I understand the extent of the limitation that my "compact proposal" brings. To sum it up, it does not allow multi-term filtering expressions within the same query.

That was intended on my part in order to try and keep the matchExpression method as simple to grasp as possible. I don't think I have ever seen support for multi-term expressions at the persistence/optimized level in JS-land, not even in https://github.com/levelgraph/levelgraph if I remember correctly, whereas I have seen it implemented in-memory multiple times (Comunica included). That's why I am looking at that feature as something that could be implemented in-memory by higher-level query engines and therefore something I'd be willing to sacrifice in favor of simplifying the API.

Complexity, I think, is one of the major pain points in RDF-land and I might very well be exceedingly biased against it.

All this said, I'd be happy to align with the less-concise version of matchExpression if you believe that multi-term filter expressions are likely to be easily implemented. Again, I have come to this world much later than you have.

Transform

Perhaps we can discuss this in a separate issue/PR that would be a dependency of this PR?

Yes, although I think that is a discussion that we should enter only if we generally agree on the idea and on its complementary nature to what we're discussing in here.

jacoscaz commented 4 years ago

Upon reflection, the proposed Transform stream should probably be a bit smarter. It should be able to delegate supported operations to the Source it is applied to while executing unsupported operations in-memory.

rubensworks commented 4 years ago

matchExpression

To sum it up, it does not allow multi-term filtering expressions within the same query.

Exactly.

I don't think I have ever seen support for multi-term expressions at the persistence/optimized level in JS-land

Indeed, I have not seen that yet either in JS.

All this said, I'd be happy to align with the less-concise version of matchExpression if you believe that multi-term filter expressions are likely to be easily implemented.

I agree with your assumption that multi-term filter expressions are not going to be supported often by sources. However, that does not mean that none of them will support it. IMO, the purpose of the RDFJS low-level APIs should be to enable implementations with a high level of (potential) expressivity, without making too much assumptions.

In the world of SPARQL query optimization, pushing down filters (in their most general form) to the storage layer is a very typical form of optimization. So I think we should remain as open as possible there.

I'm not sure if Apache Jena supports pushing down of filters in ARQ (I would assume they do). If so, perhaps we can have a look at their API to get some inspiration whether or not we are on the right track.

jacoscaz commented 4 years ago

@rubensworks

In the world of SPARQL query optimization, pushing down filters (in their most general form) to the storage layer is a very typical form of optimization. So I think we should remain as open as possible there.

I'm still afraid of what could be perceived as an overly-complex API (wrongfully, but that's often the case) but I appreciate your point and I'm happy to go along with your proposal.

Two open questions for me:

1) Shouldn't the supportsMatchExpression method accept the same arguments as the matchExpression method?

2) In case of multiple expressions, only one of which is not supported by an implementation of Source, how should we model the error so that a query-engine could rework the query to do the unsupported part in-memory?

rubensworks commented 4 years ago

I'm still afraid of what could be perceived as an overly-complex API

Let's let this rest a bit to hear what other people think about this.

Shouldn't the supportsMatchExpression method accept the same arguments as the matchExpression method?

I would say yes.

In case of multiple expressions, only one of which is not supported by an implementation of Source, how should we model the error so that a query-engine could rework the query to do the unsupported part in-memory?

Instead of throwing an error, supportsMatchExpression could also just return a boolean. In that case, a query engine could iteratively lower the complexity of expressions as follows (for example):

if (source.supportsMatchExpression(variable('s'), null, variable('o'), null, {
  operator: '<=',
  args: [
    variable('o'),
    variable('s)
  ]
})) {
  return source.matchExpression(variable('s'), null, variable('o'), null, {
  operator: '<=',
  args: [
    variable('o'),
    variable('s)
  ]
});
} else if (source.supportsMatchExpression(variable('s'), null, null, null, {
  operator: '<=',
  args: [
    literal('0'), // Just a dummy literal
    variable('s)
  ]
})) {
  const objects = source.match(null, null, variable('o'), null));
  return objects.map(object => source.matchExpression(variable('s'), null, object, null, {
    operator: '<=',
    args: [
      object,
      variable('s)
    ]
  }));
} else {
  const all = source.match(null, null, null, null));
  // Use full in-memory filtering
}
jacoscaz commented 4 years ago

Instead of throwing an error, supportsMatchExpression could also just return a boolean. In that case, a query engine could iteratively lower the complexity of expressions as follows [...]

I guess this brings in another question. Let's say we have a composite query such as what follows:

source.supportsMatchExpression(variable('s'), null, variable('o'), null, {
  operator: 'AND',
  args: [
    { operator: '>=', args: [ variable('o'), variable('s') ] },
    { operator: 'REGEXP', args: [ variable('o'), literal('regex here') ] },
  ]
})

This could fail in 3 ways:

  1. the 'AND' operator is not supported
  2. the 'REGEXP' operator is not supported
  3. the '>=' operator is supported but only for comparisons between a single term and a literal argument

I think supportMatchExpression should give some sort of indication as to what part of the query is not supported. How else would a query engine (or anything else) be able to figure out what to do in-memory and what to delegate to the Source implementation?

elf-pavlik commented 4 years ago

Do we consider that expressions can have other expressions as arguments? In that case I think support for all of those expressions may need to get checked separately first. As I understand @rubensworks example, query engine could simply have already all the alternative querying paths and than just check support for them in specific order. Reconstructing queries based on errors from support detection seems bit more complex. @jacoscaz do you have a code snippet which could showcase how you would imagine using those error messages?

jacoscaz commented 4 years ago

@elf-pavlik

Reconstructing queries based on errors from support detection seems bit more complex.

Yes, although I believe it would still be less complex than working out exactly what is supported by the Source implementation based on a boolean result that only points to whether the implementation supports a query in its entirety.

I'll go back to my last example:

source.supportsMatchExpression(variable('s'), null, variable('o'), null, {
  operator: 'AND',
  args: [
    { operator: '>=', args: [ variable('o'), variable('s') ] },
    { operator: 'REGEXP', args: [ variable('o'), literal('regex here') ] },
  ]
})

For this query, I've hypothesized three failure modes:

  1. the 'AND' operator is not supported
  2. the 'REGEXP' operator is not supported
  3. the '>=' operator is supported but only for comparisons between a single term and a literal argument

If I were constrained to using the supportsMatchExpression method and I were working with an implementation of Source that does not support multi-term expression but does support everything else, I would appreciate something like the following:

> console.log(source.supportsMatchExpression(/* ... */))
{
  supported: false,
  reasons: [
    { feature: MULTI_TERM_EXPRESSION, path: args[0].args[1] }
  ]
}

The reasons array, which could perhaps be optional, would save me a ton of effort in figuring out how to work around what is missing.

elf-pavlik commented 4 years ago

The reasons array, which could perhaps be optional, would save me a ton of effort in figuring out how to work around what is missing.

Do you mean figure out as developer trying queries manually or you would want to write code which based on those reasons would try different queries?

jacoscaz commented 4 years ago

@elf-pavlik both, although a developer would presumably look at the documentation for the given Source implementation and easily figure it out. My concern is anything that might want to optimize queries across multiple implementation of Source, such as query engines.

elf-pavlik commented 4 years ago

I don't have experience with writing query engines. Simple example provided by @rubensworks, trying specific queries in order, looks quite approachable. I'd like to see comparable snippet where code tries to create query based on reason from failing support check.

On the other hand, if reason could stay optional I don't see big difference between returning boolean vs. object with supported property. I think the difference may come from having reason required or optional. In that case query engine which can adjust based on those reasons would still need to handle cases where they don't get the reason in result of support check.

jacoscaz commented 4 years ago

@elf-pavlik I was trying to write such a snippet and, in the process of doing so, figured I am probably wrong.

I was under the impression that @rubensworks query might have been too easy to progressively simplify as it uses a common operator (>=) and single-level expressions (no logical operators such as AND and OR). Nested and more complex expressions, I thought, would be much harder to deal with as one would not know where to start from.

However, assuming simplification can always be done from the outside to the inside, peeling back each layer of the query, perhaps even the most complex of queries could simply be dissected into supported sub-queries joined by in-memory implementations dealing with the unsupported outer expressions.

Incidentally, I think this lays out a really elegant architecture for a recursive function that walks the expression tree and stops the recursion whenever it finds a supported expression.

@rubensworks - seems like you already had this proposal in your head!

bergos commented 4 years ago

I still favor having a TermFilter interface with a test method.

You are mainly talking about the use case of a query engine with only known types of operations. The plain object structure would support custom operations, but it would require to change the code of the match methods. If that is not possible a fallback must be implemented. That might turn every usage of the interface into some kind of query engine. It will get very complex for use cases where different Store implementations with different supported operations are used. The mentioned Transform can also not solve the problem because it will only support a predefined set of operations and no custom filters.

The TermFilter interface would not only reduce the complexity for these use cases, but it would also fit better to the existing Term interface which has an equals method.

I thought it might be good to try the concept, that's why I wrote a quick proof of concept which you can find here. Besides the names, the properties of the TermFilter objects are the same as in the discussion in the last comments, so there should be no drawback for the query engine use case. A factory is used to create the TermFilter instances, which makes the code even more readable than the plain objects. See the readme for more details.

jacoscaz commented 4 years ago

@bergos

First of all, thank you for your example.

You are mainly talking about the use case of a query engine with only known types of operations. The plain object structure would support custom operations, but it would require to change the code of the match methods.

True, although I think this would happen anyway in your _acceleratedMatchFilter function. At one point or another, optimizing support for a given filter at the persistence level depends on having custom code for such filter. At least as far as I understand.

If that is not possible a fallback must be implemented. That might turn every usage of the interface into some kind of query engine.

This is very true and part of the reason why I think we need different methods (i.e. we should not pollute match) and we might also need a shared implementation of one such query engine, unless we were to add a list of mandatory operators that implementors must support (such as what happens when a database aims to comply with a given standard). I prefer the former alternative as it allows implementors to only optimize for a subset of operators while also being able to quickly add non-optimized support for all of them.

The mentioned Transform can also not solve the problem because it will only support a predefined set of operations and no custom filters.

I am not sure I follow. As a developer who is thinking about how to write a query, I need a-priori knowledge of which operators are supported by the database I am working with. As a developer of the database itself, I need to choose which operators to optimize for from a list of known operators.

I guess what I am trying to say is that custom filters are bound to be handled in-memory as databases cannot optimize for filters they are not aware of. If we are to support optimization, we need to produce a list of available operators / expressions for which implementors might choose to optimize.

All this said, I am strictly focusing on the stream spec and I think that what you have exemplified fits very well within the dataset spec. The latter, although still defined as a low-level spec, seems to me to be a little further from the metal in terms of abstraction. dataset seems to focus on working with quads that have already been retrieved from whatever source one is dealing with.

rubensworks commented 4 years ago

@jacoscaz

assuming simplification can always be done from the outside to the inside, peeling back each layer of the query. I think this lays out a really elegant architecture for a recursive function that walks the expression tree and stops the recursion whenever it finds a supported expression.

Yep, that's what I had in mind, you just explained it much better than I did 🙂.

@bergos @jacoscaz

I think we need different methods (i.e. we should not pollute match)

👍

I think that what you have exemplified fits very well within the dataset spec.

I agree with this. The TermFilter is a useful abstraction, but IMO fits better on a higher level, such as the Dataset spec. By letting sources execute optimized filters, and requiring consumers to perform non-optimized in-memory filters themselves, this would lead to a cleaner separation of concerns IMO.

bergos commented 4 years ago

@jacoscaz @rubensworks Can you please explain where there are limitations in my proposal that make it unusable for your use case?

fits better on a higher level, such as the Dataset spec.

Even if you like to implement only accelerated filters in Source, I don't think it's a good idea to define two different interfaces for it. We should keep it in mind to keep the APIs consistent.

jacoscaz commented 4 years ago

@bergos I don't see them as limitations but more as bad ergonomics.

  1. As filters are created per-term, one issue I would struggle with is how to easily represent multi-term expressions, which I think, and I've come to agree with @rubensworks on this, must be supported at the lowest possible level of the RDF/JS spec (i.e. the one closest to the metal).

  2. Your definition of TermFilter and the way I would find myself using it almost makes it look as if, no matter what filter you use, there will never be any difference in terms of performance. Correct me if I am wrong but this, to me, seems intentional given what you have written in your example:

matchFilter (subject, predicate, object, graph) {
    if (this.isAcceleratedMatchFilter(subject, predicate, object, graph)) {
      return this._acceleratedMatchFilter(subject, predicate, object, graph)
    }

Considering the low-level, close-to-the-metal nature of the Source spec, I do not think this is the right approach.

There should be an explicit difference and a clear separation of concern between what can be handled at the persistence level and what can be handled in-memory. In my mind, the goal of the Source spec is to correctly convey what can be done with a given implementation of such interface, including which operators are supported so that I can write my queries accordingly (or structure my indexes accordingly, which quadstore will soon allow you to do).

I look at Source as something equivalent to the SPARQL standard or the SQL standard, where I am always able to find a clear list of supported expressions / operators, with implementors of those standards often warning that they individually support only a subset of those operators.

That said, I find TermFilter to be very useful at the Dataset level, where I expect developers not to care as much about the performance aspects of what they are writing.

bergos commented 4 years ago

I would struggle with is how to easily represent multi-term expressions

Do you mean comparing two terms of the same quad? It's even much easier than in your proposal:

store.matchFilter(rdf.variable('s'), null, filter.gt(rdf.variable('s'))

Your definition of TermFilter and the way I would find myself using it almost makes it look as if, no matter what filter you use, there will never be any difference in terms of performance.

No, you can ask before with isAcceleratedMatchFilter if it would run accelerated. If it doesn't run accelerated, it's the same as .match.

There should be an explicit difference and a clear separation of concern between what can be handled at the persistence level and what can be handled in-memory.

That was never the intention of the Source interface. For example you could also use the Source interface to generate quads based on an algorithm and process it in a streaming way so you don't have to keep all of the quads in-memory at the same time.

In my mind, the goal of the Source spec is to correctly convey what can be done with a given implementation of such interface, including which operators are supported so that I can write my queries accordingly (or structure my indexes accordingly, which quadstore will soon allow you to do).

There is already the .match method, that means my proposal is actually more close to the current definition of the Source interface. With the isAcceleratedMatchFilter method it should also fulfill your requirements.

I really would like to have more advanced filtering in the Source interface, but your proposal doesn't cover the full scope of the Source interface and I'm strongly against limiting the interface to a reduced set of use cases. If you think you can't accept a compromise and insist on your approach, I would propose to do it outside the stream-spec.

rubensworks commented 4 years ago

@bergos

Can you please explain where there are limitations in my proposal that make it unusable for your use case?

I agree with @jacoscaz. I don't see any limitations with your proposal. I consider both proposals having a similar level of expressivity.

your proposal doesn't cover the full scope of the Source interface and I'm strongly against limiting the interface to a reduced set of use cases.

How so? AFAICS they are both as expressive. Unless I'm missing something?

store.matchFilter(rdf.variable('s'), null, filter.gt(rdf.variable('s'))

I definitely agree that this (proposal 1) is easier to write than something like this (proposal 2):

source.matchExpression(variable('s'), null, variable('o'), null, {
  operator: '<=',
  args: [
    variable('o'),
    variable('s')
  ]
});

However, I think we're comparing different levels of abstractions here. We could easily introduce a similar helper tool that produces the expression from proposal 2, e.g.: filter.lte(variable('o'), variable('s)). So I would not focus too much on these helper methods, but on the format they should produce, and what should be passed to matchExpression.

As I see it, the main (remaining) difference between proposal 1 and 2 is on how they handle expressions that are not supported by the Source. While proposal 1 incorporates a filter argument that can be used as a fallback by the Source, proposal 2 requires the Source to do nothing, and the consumer to do the filtering itself.

Am I correct in my analysis of these differences? If so, then it seems to me like proposal 2 could easily be built on top of proposal 1, right? My gut feeling says that the declarative (and lower-level) nature of proposal 2 would be a good basis to build upon. (Btw, this declarative structure is also followed by query engines in other languages using e.g. SPARQL algebra, so it would feel strange to not do the same)

An important advantage of using proposal 2 as a basis I just thought of is the fact that it can easily cross programming language borders, due to it being purely declarative, and is not using any JS functions. This would be extremely important when running within Web browsers and you want to delegate certain filtering to a WASM source, or when running in Node.js and you need to communicate with C/C++ over NAN.

If you think you can't accept a compromise and insist on your approach, I would propose to do it outside the stream-spec.

Don't worry, we'll get there 🙂

jacoscaz commented 4 years ago

@bergos

Do you mean comparing two terms of the same quad? It's even much easier than in your proposal: store.matchFilter(rdf.variable('s'), null, filter.gt(rdf.variable('s'))

I personally do not find what you suggest any easier to grasp than having the expression represented separately from the terms. This is probably a simple matter of taste and/or habit, though. I terms of expressive power, I cannot see any difference between the two.

If you think you can't accept a compromise and insist on your approach, I would propose to do it outside the stream-spec.

That is not a direction I would like to take. I appreciate what RDF/JS is trying to achieve and I am trying to contribute to the same goal. I might argue against or towards specific solutions, even passionately at times, but that's because I honestly believe they represent superior compromises. I hope I have not come across as exceedingly or needlessly confrontational; if so, that was not my intention and I apologize.

I fundamentally agree with the following points @rubensworks has just made:

While proposal 1 incorporates a filter argument that can be used as a fallback by the Source, proposal 2 requires the Source to do nothing, and the consumer to do the filtering itself.

My gut feeling says that the declarative (and lower-level) nature of proposal 2 would be a good basis to build upon. (Btw, this declarative structure is also followed by query engines in other languages using e.g. SPARQL algebra, so it would feel strange to not do the same)

bergos commented 4 years ago

I hope I have not come across as exceedingly or needlessly confrontational; if so, that was not my intention and I apologize.

Sorry, but I'm a bit tired of repeating stuff again and again, spending time to show alternatives and nobody takes the time to have a look. It seems to me that some people are only interested to make the interface as close as possible to their existing code and ignore other cases.

First one important fact: Everyone has to implement the match method, no matter if it runs accelerated or not. That's already part of the current spec. Adding support for TermFilter would be very easy and I showed that already in my PoC.

My gut feeling says that the declarative (and lower-level) nature of proposal 2 would be a good basis to build upon. (Btw, this declarative structure is also followed by query engines in other languages using e.g. SPARQL algebra, so it would feel strange to not do the same)

The TermFilter proposal contains all properties of the other proposal, the only difference is an additional method. That means you have your declarative structure.

And again you are focusing on query engines. There are many use cases where a simple match is enough, still the TermFilter proposal would contain everything required for your query engine use case.

An important advantage of using proposal 2 as a basis I just thought of is the fact that it can easily cross programming language borders, due to it being purely declarative, and is not using any JS functions. This would be extremely important when running within Web browsers and you want to delegate certain filtering to a WASM source, or when running in Node.js and you need to communicate with C/C++ over NAN.

You have all the properties for code that runs accelerated. Anyway you have to translate the Quads again in JS to have full featured RDF/JS Quads. The RDF/JS spec has been written for JavaScript and it was clearly stated that the spec focuses on JS only. I agree that we may align it a little bit to WASM use cases if possible, but it doesn't make any sense to give up a consistent API for a filter when all the Quads don't have a plain object structure. If there is any benefit for a plain object, it doesn't matter compared to the Quads. I haven't done much WASM stuff yet, but anyway I guess it's a common pattern to convert plain objects to full featured classes in both ways.

jacoscaz commented 4 years ago

Sorry, but I'm a bit tired of repeating stuff again and again, spending time to show alternatives and nobody takes the time to have a look. It seems to me that some people are only interested to make the interface as close as possible to their existing code and ignore other cases.

In the interest of clarity and transparency, I did look at your example and I am not trying to make the interface as close as possible to my code as I don't even have a well-established codebase to refer to.

Everyone has to implement the match method, no matter if it runs accelerated or not.

Agreed and I think we are all in agreement about this.

Adding support for TermFilter would be very easy and I showed that already in my PoC.

I agree that this is easy, although it still doesn't seem practical or ergonomic to me. I have already explained why and I don't think it would help if I were to repeat myself. I understand we do not agree on this and I am happy to come to your side if this means pushing this matter forward. The expressive power of the two solutions is still the same AFAIU.

In the spirit of building consensus towards a solution we can all agree with, a few questions:

  1. would you agree with the fact that, if we were to proceed with TermFilter, we would still have to expand the spec with the set of supported operators, the set from which implementors might choose which operators to optimize for?
  2. would you agree with the fact that a shared implementation of the in-memory test() fallback for most common filters would be a nice addition to the ecosystem?
elf-pavlik commented 4 years ago

I think it would come helpful if we pick 6 different real world examples with increasing level of complexity and than write equivalent snippets using both of discussed approaches. Based on that we may have better ground to evaluate pros and cons of those alternative approaches. I think any of your three could propose first (simplest) example and then next person would add another little more complex and so on.

jacoscaz commented 4 years ago

@elf-pavlik that's a worthy exercise, I think. Here's a very simple example:

source.matchExpression(null, nameNode('lastSeen'), variable('o'), null, {
  type: '>=',
  args: [
    variable('o'), 
    literal('2020-01-01', 'xsd:dateTime'),
  ],
});
source.matchFilter(
  null, 
  namedNode('lastSeen'), 
  filters.gt(literal('2020-01-01', 'xsd:dateTime')), 
  null,
);

At this level of complexity, I think the two alternatives are almost 100% equivalent in terms of ergonomics and practicality. Actually, at this level of complexity I think @bergos' form is better as it is more consistent with the rest of the spec, although I would still maintain that the gt filter would have to be an implementation a filter defined within the spec itself.

rubensworks commented 4 years ago

@bergos

Sorry, but I'm a bit tired of repeating stuff again and again, spending time to show alternatives and nobody takes the time to have a look

I'm sorry you feel that way. I did look at your implementation, but I do not agree with certain parts of it, which is why we are having this discussion.

Adding support for TermFilter would be very easy and I showed that already in my PoC.

Easy from who's perspective? I have already shown that it's not easy at all in some cases, such as the NAN/WASM cases, and handling unwanted filter support.

The TermFilter proposal contains all properties of the other proposal

Almost, there's still the difference in how the expressions are represented. Your proposal incorporates this into terms, which is less expressive than the other proposal, as @jacoscaz and I had discussed earlier.

the only difference is an additional method. That means you have your declarative structure.

Indeed, and that's the point we disagree on. This additional method voids the declarative structure IMO, and complicates things significantly.

The equals() method on quads and terms is related to this, which IMHO does more harm than good. Personally, I ran into more problems related to it (WASM, NAN, ...), than cases where I actually was able to use it. As such, I strongly disagree on adding such a method here as well.

I'm of the opinion that this test method can easily be moved to a higher level, but it should IMO not become part of the low-level spec.

Another major problem with this test method I just realized is that it requires consumers to implement these test methods themselves, even if the consumer does not want to support certain types of filters. By remaining fully declarative, the Source would fail upon encountering an unsupported filter, and the consumer could easily detect this failure, and handle the error accordingly.

it doesn't make any sense to give up a consistent API for a filter when all the Quads don't have a plain object structure.

The difference with the equals() method of quads and terms is that this can be hooked in via prototypes, as there is only a small number of possible equals() implementations. For filters on the other hand, the number of possible implementations is infinite, which makes deserialization from plain objects much trickier. A fully declarative approach would not have this problem.

@jacoscaz

  1. would you agree with the fact that, if we were to proceed with TermFilter, we would still have to expand the spec with the set of supported operators, the set from which implementors might choose which operators to optimize for?

The list of operators seems like a must to me indeed.

  1. would you agree with the fact that a shared implementation of the in-memory test() fallback for most common filters would be a nice addition to the ecosystem?

Yes, I agree that this would be nice to have. However, it should remain optional, and should not be part of the low-level spec IMO.

@elf-pavlik

I think it would come helpful if we pick 6 different real world examples with increasing level of complexity and than write equivalent snippets using both of discussed approaches.

That may indeed be a helpful exercise :-)

Let me continue upon @jacoscaz's snippet:

source.matchExpression(null, nameNode('lastSeen'), variable('o'), null, {
  type: '&&',
  args: [
    {
      type: '>=',
      args: [
        variable('o'), 
        literal('2020-01-01', 'xsd:dateTime'),
      ],
    },
    {
      type: '<=',
      args: [
        variable('o'), 
        literal('2021-01-01', 'xsd:dateTime'),
      ],
    }
  ],
});
source.matchFilter(
  null, 
  namedNode('lastSeen'), 
  filters.and(
    filters.gte(literal('2020-01-01', 'xsd:dateTime')),
    filters.lte(literal('2021-01-01', 'xsd:dateTime'))
  ), 
  null,
);

This is still doable in both approaches.

Note that I don't consider the first option to be more complex than the second. One could easily add helper functions such as @bergos's filters to do something like this (modulo test functions):

source.matchExpression(null, nameNode('lastSeen'), variable('o'), null,
  filters.and(
    filters.gte(variable('o'), literal('2020-01-01', 'xsd:dateTime')),
    filters.lte(variable('o'), literal('2021-01-01', 'xsd:dateTime'))
  )
);

Perhaps we can try an expression with two variables as next use case?

bergos commented 4 years ago

would you agree with the fact that, if we were to proceed with TermFilter, we would still have to expand the spec with the set of supported operators, the set from which implementors might choose which operators to optimize for?

From my readme:

Besides the TermFilter interface, a set of filters should be included in the specification. The filters.js file contains a set of filters that may be good candidates to guaranteed interoperability for accelerated execution.

would you agree with the fact that a shared implementation of the in-memory test() fallback for most common filters would be a nice addition to the ecosystem?

Maybe.

I think it would come helpful if we pick 6 different real world examples with increasing level of complexity and than write equivalent snippets using both of discussed approaches.

That doesn't matter much, cause helper functions could be used so they look the same in the end. It's more important what happens after the method call. That should be compared with cases where only the match method is used and with cases where the new filtering is used. My repository contains already some examples incuding one with a match method call. I think somebody supporting the other proposal should translate the examples to the other proposal.

I'm sorry you feel that way.

Let me quote myself:

I'm a bit tired of repeating stuff again and again

and

Everyone has to implement the match method, no matter if it runs accelerated or not. That's already part of the current spec. Adding support for TermFilter would be very easy and I showed that already in my PoC.

and

The RDF/JS spec has been written for JavaScript and it was clearly stated that the spec focuses on JS only.

The difference with the equals() method of quads and terms is that this can be hooked in via prototypes, as there is only a small number of possible equals() implementations. For filters on the other hand, the number of possible implementations is infinite, which makes deserialization from plain objects much trickier. A fully declarative approach would not have this problem.

That's a comparison of apples with pears. With the TermFilter proposal you would only forward the known filters to the WASM code and that means there is no need to know the logic of the test method. Unknown filters could be only handled in JS unless you transpile them. With your proposal you would forward only known filters. I don't see any difference.

elf-pavlik commented 4 years ago

Perhaps we can try an expression with two variables as next use case?

In one of earlier comments I see one example, I've change operator to make comparing with IRI of subject more realistic:

source.matchFilter(
  variable('s'),
  null,
  filter.eq(variable('o'),
  null
)
source.matchExpression(variable('s'), null, variable('o'), null, {
  operator: '==',
  args: [
    variable('o'),
    variable('s')
  ]
});

Looks like something that finds statements where node has some relation with itself. Could someone provide more interesting example with two variables? @jacoscaz in this example you provided some time ago I don't see how >= supposed to work on subject which needs to be IRI

source.matchExpression(variable('s'), null, variable('o'), null, {
  operator: 'AND',
  args: [
    { operator: '>=', args: [ variable('o'), variable('s') ] },
    { operator: 'REGEXP', args: [ variable('o'), literal('regex here') ] },
  ]
})
jacoscaz commented 4 years ago

@elf-pavlik

in this example you provided some time ago I don't see how >= supposed to work on subject which needs to be IRI

WRT the IRI issue, I am under the impression that RDF also supports "reverse" predicates, something like:

"jacopo"^^it ex:firstNameOf http://jacoscaz

If this is the case, then subjects can be IRIs, blank nodes, literal, ... I am 100% sure I have used at least one store supporting reverse predicates but that might have been an exception to the rule.

WRT the >= applied to IRIs issue, I guess the questions would boil down to whether we intend to provide support (i.e. have an operator) for lexicographical order comparison. I've seen it used a few times but only to optimize queries for which no better operators were supported. For example, a query based on the regexp /^hello[0-9]{2}/ could be reimplemented as a query targeting terms >= "hello" AND <= "hello<boundary_char>", filtering out the rest in-memory.

@bergos

Besides the TermFilter interface, a set of filters should be included in the specification.

I had missed this as I was trying to quickly locate the bits of code I wanted to think about. Looks like we all agree on this point - good!

That doesn't matter much, cause helper functions could be used so they look the same in the end. It's more important what happens after the method call.

I disagree. This matters to me, although I respect the fact that it does not to you. I will build upon @elf-pavlik 's example:

source.matchExpression(variable('s'), null, variable('o'), variable('g'), {
  operator: '&&',
  args: [
    { operator: '==', args: [ variable('o'), variable('s') ] },
    { operator: '==', args: [ variable('g'), namedNode('http://some/graph') ] },
  ]
});
source.matchFilter(
  variable('s'),
  null,
  filter.eq(variable('s')),
  filter.eq(namedNode('http://some/graph'))
)

With the first, more declarative style, the entirety of the expression is contained within the fifth parameter. This makes it a lot easier for me to read (and this is subjective) but also to parse into an expression tree.

With the second style, I find it harder to understand what the combined effects of all filters is. Parsing them also seems a bit harder.

Generally speaking, this is an example of what I refer to when I use the term ergonomics. However, the two could be combined in the following:

source.matchExpression(variable('s'), null, variable('o'), variable('g'), filter.and(
  filter.eq(variable('o'), variable('s')),
  filter.eq(variable('g'), namedNode('http://some/graph')),
));

I find this rather appealing, and this form would also lend itself better when looking at building the code that would concatenate all the .test() fallbacks.

With your proposal you would forward only known filters. I don't see any difference.

I think the main point of contention is whether the match() method itself (or a matchFilter method) would be allowed to return an error in the case of an unsupported operator or whether it would have to use the .test() method fallback because that would be mandated by the spec.

Given an implementation of Source, I (as a developer) need to carefully write my queries around what I know to be the supported operators. If we agree on adding operators to the spec and leaving Source implementors free to specify which ones they support, then I believe that throwing errors would be a vastly more ergonomic strategy.

EDITED: fixed a leftover/redundant/bad use of variable in one of the examples.

elf-pavlik commented 4 years ago

If this is the case, then subjects can be IRIs, blank nodes, literal, ... I am 100% sure I have used at least one store supporting reverse predicates but that might have been an exception to the rule.

https://www.w3.org/TR/rdf11-concepts/#section-triples clearly states:

  • the subject, which is an IRI or a blank node
  • the predicate, which is an IRI
  • the object, which is an IRI, a literal or a blank node

One could use tricks like

ex:jacosName
  a rdfs:Literal;
  rdf:value ""jacopo"@it .

ex:jacosName ex:firstNameOf http://jacoscaz .

And sort of get literal in subject position, but I really don't see it used there directly.

elf-pavlik commented 4 years ago

@jacoscaz could you please use language tags in your snippets for syntaxt highlighting? I let myself edit some of your recent comments s/`\/ ```js/g

jacoscaz commented 4 years ago

https://www.w3.org/TR/rdf11-concepts/#section-triples clearly states:

I'm really happy to stand corrected! This wonderfully simplifies a number of things on my end. And yes, I will be more careful with using language tags. Thank you fixing my posts.

bergos commented 4 years ago

I'm for closing this PR, because in most comments an approach is discussed that is against the core idea of the RDF/JS specs to define JavaScript-idiomatic APIs. If people would like to discuss the basic concepts of the RDF/JS spec, this PR is not the best place.

Other PRs to add more advanced filtering are very welcome, but use cases covering the full scope of the Source interface should be collected first in an issue.

rubensworks commented 4 years ago

use cases covering the full scope of the Source interface should be collected first in an issue.

Good point, I've created #17 for this. Shall we also move the snippets started by @elf-pavlik to this issue, or shall we create a separate one for it?

I'm for closing this PR

I would actually keep this PR open, and update it with the parts that we all agree on already. Because I think we're actually more aligned than it seems. We can discuss the remaining details after that.

jacoscaz commented 4 years ago

@bergos and all

Before closing this PR, I would kindly ask you to consider what follows.

In my previous comment I have tried to reconcile the two approaches by proposing the following:

source.matchExpression(variable('s'), null, variable('o'), variable('g'), filter.and(
  filter.eq(variable('o'), variable('s')),
  filter.eq(variable('g'), namedNode('http://some/graph')),
));

In my mind, this gives us:

I would also add:

Could this make for a good, if perhaps incomplete, compromise?

jacoscaz commented 4 years ago

@bergos @rubensworks if you have time to do so I'd appreciate your feedback on the above. Then, if everyone agrees, I'll probably close this PR and open a new one starting from what we find ourselves in agreement about and the remaining open questions.

rubensworks commented 4 years ago

@jacoscaz I agree with the above, except for the following:

support for a fallback method Filter#test() as per your concern

I'm still of the opinion that this should become part of a different spec, or at least should become optional.

there could, and should in my opinion, be a case to be made for a shared implementation of the code that would be needed to concatenate the various .test() fallbacks in the case of expression with a mix of supported and unsupported Filter instances.

Same as above.

(Apologies for the delayed answer)

bergos commented 4 years ago

As I stated out in issue about removing .equals, the topic plain object vs. object with methods is not that easy as may look at a first glance and should therefore not be discussed in this PR. That's why I'm for the .test method. I also understand the use cases where only optimized filters are handled, but it's not consistent with the existing Source interface and the .match method. I would propose to define a new interface just for that use case. Maybe it can be called OptimizedSource, so that name covers the use case. The extension of the Source interface, the new OptimizedSource and the interface of the filters should be discussed together and in the end, one PR should be made for all three interfaces to avoid incompatibilities.

jacoscaz commented 4 years ago

Hi all. Coming back to this after a few busy weeks. The more I read and think about it, the more I agree with @bergos proposal of having different interfaces.

I'm going to quote a few paragraphs from the other discussion:

Using an index to solve a triple patterns is not a query engine.

I do not agree with this limitation to the definition of a query engine. Selecting which index is better suited for a given query definitely falls within the scope of a query engine.

It should be possible to define custom filters.

I agree with the fact that custom filters should be supported but, due to their very nature, they cannot be optimized for at the persistence level.

Our use case would be on the optimization of query engines by pushing down filters within the query plan to the storage level.

This necessarily requires a descriptive approach to representing filters in such a way as to be easily translated across languages. Not doing so would greatly limit the number of viable persistence solutions.

Based on all this, it looks to me like custom filters and persistence-level optimization cannot be reconciled into a single interface. However, it also looks to me like they do fit within a hierarchy:

Both the base filtering source and the advanced filtering source can be designed to support idiomatic filters and the serialization of these filters into declarative data structures. Source implementors dealing with persistence-level optimization would only be able to optimize up to the base filtering source level.

Filter implementations would also have to be split across two levels:

What do you think @bergos and @rubensworks ? Also @elf-pavlik ? Could this be a viable approach?