General query engine interface

rubensworks commented 3 years ago

After some internal discussions with @gsvarovsky and @jacoscaz, we identified the need to come up with a base query engine interface for RDF/JS (for declarative queries).

In essence, it should expose an interface that allows you to do operations such as const resultStream = await engine.query('some query');

The goal of this issue is to collect input on what already exists, so we can identify what the requirements are for such an interface.

Projects I contribute to that would benefit from this interface:

The Comunica query engine. Relevant types are defined here: https://github.com/comunica/comunica/blob/master/packages/types/index.ts
rdf-test-suite.js, which can execute specification tests on query engines. Relevant interface is here: https://github.com/rubensworks/rdf-test-suite.js/blob/master/lib/testcase/sparql/IQueryEngine.ts
SPARQLAlgebra.js, which defines typings for SPARQL algebra
fetch-sparql-endpoint.js, exposes access to a SPARQL endpoint

Big open question for me is how close the relationship to SPARQL should be. (We could for example start off with defining it in terms of SPARQL, but leave room for other query languages)

tpluscode commented 3 years ago

https://github.com/zazuko/sparql-http-client (types) https://github.com/Callidon/sparql-engine

RubenVerborgh commented 3 years ago

(We could for example start off with defining it in terms of SPARQL, but leave room for other query languages)

Optional second argument, defaulting to { language: 'SPARQL', version: '1.1', extensions: [] }?

gsvarovsky commented 3 years ago

I would want to implement this interface with m-ld's Javascript engine, so that it can operate in environments that use SPARQL queries directly.

I have implemented my own interface, which looks like this: https://github.com/m-ld/m-ld-js/blob/edge/src/rdfjs-support.ts

Note the dependency on the sparqlalgebrajs types – I would prefer not to have to pass strings, so there may be a need to have an interface package extracted from sparqlalgebrajs.

I would also prefer that the interface allowed a store to be quite explicit about which queries it supports (e.g. Construct but not Describe).

jacoscaz commented 3 years ago

I maintain quadstore, a persistent RDF store with SPARQL capabilities via Comunica, and I would happily implement the proposed base query engine interface.

tpluscode commented 3 years ago

Does it have to be complicated much? I've been working with @bergos' sparql-http-client and I think's just about right. It comes in two forms, both of which have methods select/construct/ask/update but differ on the return types from select and construct

declare module 'sparql-http-client/StreamClient' {
  class StreamClient {
    query: {
      select(query: string): Promise<EventEmitter>
      construct(query: string): Promise<import('rdf-js').Stream>
      ask(query: string): Promise<boolean>
      update(query: string): Promise<void>
    }
  }
}

declare module 'sparql-http-client/ParsingClient' {
  class StreamClient {
    query: {
      select(query: string): Promise<Array<Record<string, import('rdf-js').Term>>>
      construct(query: string): Promise<import('rdf-js').Dataset>
      ask(query: string): Promise<boolean>
      update(query: string): Promise<void>
    }
  }
}

The only change that I would make above is for the StreamClient not to return promises

class StreamClient {
  query: {
-    select(query: string): Promise<EventEmitter>
-    construct(query: string): Promise<import('rdf-js').Stream>
+    select(query: string): EventEmitter
+    construct(query: string): import('rdf-js').Stream
  }
}

jacoscaz commented 3 years ago

both of which have methods select/construct/ask/update

I think I see the point you're making but doing this would imply leaking aspects of the query language into the query interface. It would be equivalent to having separate methods for SELECT, INSERT, DELETE and UPDATE queries in SQL-oriented database drivers ~~and ORMs~~. IMHO, this would make it much harder to deal with use cases in which the nature of a query is not known ahead of time and also introduce too strong a coupling between SPARQL and the interface itself.

tpluscode commented 3 years ago

Database drivers (SDK?) do very much have similar distinction and it's nothing wrong. Think query/execute/executeScalar. I will not even comment on ORMs because that is the worst comparison ever. Might have a look at micro ORMs like dapper which are much "closer to metal" to mitigate the impedance mismatch issues plaguing ORMs

use cases in which the nature of a query is not known ahead of time

I'm curious about this statement. In what scenarios is it not known what is the desired kind of result? (tabular, graph or boolean).

and also introduce too strong a coupling between SPARQL and the interface itself.

What is the nature of this coupling? Arguably, the whole RDF stack is built on uniformity and standards. The RDF graph is the same graph in every software component. SPARQL, being similarly important core standard, can act the same. Otherwise I read you comment as an invitation to build (IMO unnecessary) abstractions

rubensworks commented 3 years ago

I've created a new issue to follow up on the discussion of query methods: https://github.com/rdfjs/query-spec/issues/6 (So that we can keep this issue here focussed on collecting existing approaches)

tpluscode commented 3 years ago

I might actually mention also my lib @tpluscode/sparql-builder. Right now it does rely on sparql-http-client for execution but a standard interface would be nice, especially if we could get an in-memory query engine

rubensworks commented 3 years ago

Ah yes indeed, libs that depend on engines are also relevant to include here.

In that respect, the following libs may also be relevant:

https://github.com/LDflex/LDflex (currently hardcoded on Comunica's interface)
https://github.com/rubensworks/graphql-ld.js (currently hardcoded on Comunica's interface)

ericprud commented 3 years ago

I'd suggest starting with the SPARQL Algebra (but be willing to depart from it as use cases indicate). You can do a lot with it (like all of SPARQL) but is simpler than SPARQL. For instance, the idiosyncracies of a SolutionModifier's GROUP BY and HAVING reuse aggregation and filter. You may also opt to lop off large parts of it, but having a set of composable operations should be familiar to programmers.

gsvarovsky commented 3 years ago

@tpluscode https://github.com/rdfjs/query-spec/issues/5#issuecomment-931370498

not to return promises

👍

rubensworks commented 3 years ago

not to return promises

That would actually depend on the outcome of #6. Because if we only expose a single method there, then the return type would vary based on the query. Since query type and return type may be determined async, we may require promises.

rubensworks commented 3 years ago

I've had a look at all the suggested libraries, and I've tried to create an overview aspects that I feel may require some standardization. If I missed any, please let me know!

Once we agree upon a list of aspect, we can branch of into separate issues to see how we want to tackle the specifics of each one.

1. Query method interface

How to pass a query to a library, and obtain results.

Discussion in #6.

Single method

All query forms are handled via a single method, possibly via method overloading or union types.

Example:

      query(query: string): Promise<SomeUnionType>

Implemented by:

M-ld
Quadstore
Comunica
rdf-test-suite.js

Form-based methods

Each query form has its own dedicated method.

Example:

      select(query: string): Promise<Array<Record<string, import('rdf-js').Term>>>
      construct(query: string): Promise<import('rdf-js').Dataset>
      ask(query: string): Promise<boolean>
      update(query: string): Promise<void>

Implemented by:

sparql-http-client
fetch-sparql-endpoint.js

Other

The following libraries follow another query interface, which seem to be use-case-specific, and may not benefit that much from standardization:

sparql-engine
LDflex
GraphQL-LD
@tpluscode/sparql-builder

2. Representing bindings

How to represent the results of tabular queries such as SELECT.

JSON-based

Example of a single bindings object:

{
  '?varA': namedNode('ex:a'),
  '?varB': namedNode('ex:b'),
}

Implemented by:

M-ld
rdf-test-suite.js
sparql-http-client
sparql-engine
@tpluscode/sparql-builder
fetch-sparql-endpoint.js

Object-based

A custom datastructure that exposes methods and allows bindings to be stored internally in a different manner.

Example of a single bindings object:

const bindings = ...
const term: RDF.Term = bindings.get('?a');

Implemented by:

Quadstore
Comunica

3. Exposing metadata

Both on query-level as on source-level, it may be beneficial to expose metadata such as cardinality (estimates). Such information may be useful for query optimization.

Dedicated method for obtaining metadata

interface CountableSource {
  countQuads(): Promise<number> | number
}

Implemented by:

Comunica
M-ld

Generic object that provides metadata

const results = engine.query(...);
const metadata = await results.metadata();
console.log(metadata.cardinality);

Implemented by:

Comunica

4. Serializing results

A method to serialize query results to a standard format, such as SPARQL JSON results. Related to this, methods may be added that expose the available formats.

  resultToString: (queryResult: ..., format?: string) => Promise<Stream<string>>;

Implemented by:

Comunica

5. Defining sources

Some engines allow query sources to vary per query execution, and therefore enable passing it as an additional argument.

  query: (query: string, context: { sources: IDataSource[] }) => Promise<IQueryResult>;

export type IDataSource = string | RDF.Source | {
  type?: string;
  value: string | RDF.Source;
  context?: ActionContext;
};

Implemented by:

Comunica
rdf-test-suite.js

6. Passing query as algebra

Instead of passing a query string to an engine, a (pre-optimized?) algebra object may be passed. Related to this, methods for parsing a query string to algebra may also be valuable to standardize.

Example:

export interface QueryableRdf<Q extends BaseQuad = Quad> {
  query(query: Algebra.Construct): Stream<Q>;
  query(query: Algebra.Describe): Stream<Q>;
  query(query: Algebra.Project): BaseStream<Binding>;
}

Implemented by:

M-ld
Quadstore
Comunica

7. Defining query syntax format

If engines support different query syntaxes, they typically allow this to be customized via an optional argument.

Example:

  query: (query: string, context: { queryFormat: string }) => Promise<IQueryResult>;

Implemented by:

Comunica

jacoscaz commented 3 years ago

@rubensworks thank you for this list, very useful. I think as long as we keep our comments short, we might be able to discuss all of these points in this thread without branching into separate issues, which makes it a lot harder to keep track of the general picture IMHO. Of course, we will need to branch out for any point that sparks significant discussion.

My preferences..

1. Query method interface

Discussion in #6. My preference goes for single method + return type metadata.

2. Representing bindings

My preference goes for JSON-based representation (simple objects).

3. Exposing metadata

My preference goes for a generic object that provides metadata. I find that this approach leads to easier and better optimization in terms of sharing computation between metadata and query results. Worth mentioning that this is starting to have significant overlap with the current FilterableSource spec contained in this repo.

4. Serializing results

I would prefer not to standardize serialization in this spec.

5. Defining sources

Definitely in favor of this.

6. Passing query as algebra

Definitely in favor of this.

7. Defining query syntax format

Also discussed in #6, I have no need for anything else than SPARQL but I defer to people working with multiple query languages on this one.

jacoscaz commented 3 years ago

Some of my notes from today's call with @gsvarovsky and @rubensworks...

1. Query method interface

@gsvarovsky pointed out that the single method approach leads to code which is not as immediate and easy to grasp. Nonetheless, we ultimately settled this approach mainly due to its flexibility, potential for optimization and the fact that it's the only method capable of covering all of our current use-cases. We evaluated using base classes for convenience methods but ultimately elected not to include convenience methods to keep the spec as small as possible.

2. Representing bindings

@rubensworks explained the inherent risk of conflicts with native object properties when using bindings representations based on simple javascript objects. We discussed using an object-based representation with instance-level methods strictly related to reading bindings (.get('?var')) and class-level static methods for more general manipulation of bindings. We also believe that following in the footsteps of the RDF/JS data model by using factory functions would be a good idea (const bindings = DataFactory.bindings()).

Open question: should we keep the ? in variable names?

3. Exposing metadata

Related to point 1), we discussed the single method approach, with the main query() method returning an intermediate result object having a metadata(): Promise<Metadata> method. Standardized metadata would include quad/bindings count and ordering. We noticed that the intermediate result object overlaps with the FilterableSource spec.

4. Serializing results

We all agree that serialization should not fall within the scope of this spec.

5. Defining sources

Defining sources at query time allows query engines to be re-used across sources.This would be best modeled by passing sources as a parameter/option of the main query method: .query('SELECT ...', { sources: [store]}).

6. Passing query as algebra

We considered basing the spec around two different query methods, one taking a SPARQL string and the other taking a SPARQL Algebra object. Implementors would be free to implement either/or.

7. Defining query syntax format

We all agree on keeping this spec SPARQL-based.

blake-regalia commented 3 years ago

Was there a posting about scheduling a call? Best to keep such things open to the community instead of behind closed doors.

rubensworks commented 3 years ago

@blake-regalia There were no formal RDF/JS calls, no. Just some informal talk between @gsvarovsky, @jacoscaz, and myself about the overlaps between our work, and potential alignments.

Definitely open to have a call about the query spec, but not sure there is a real need for one at this stage? Discussions via GH issues seems to be progressing quite well.

blake-regalia commented 3 years ago

I have done a lot of work with query impls tangential to graphy so i do feel i want to be part of the conversation but haven't had the bandwidth to type up lengthy responses. I would appreciate being part of the discussion over the phone however, just saying.

rubensworks commented 3 years ago

I would appreciate being part of the discussion over the phone however, just saying.

@blake-regalia Of course! Would you like to initiate scheduling a call?

RubenVerborgh commented 3 years ago

Open question: should we keep the ? in variable names?

Just reacting to this tiny nit: I would suggest to drop the question mark.

SPARQL already has two syntaxes for variables, one with ? and one with $, and they indicate the same variable:

A query variable is marked by the use of either "?" or "$"; the "?" or "$" is not part of the variable name.

—https://www.w3.org/TR/sparql11-query/#QSynVariables

bergos commented 3 years ago

Please keep the variable interface of the Data Model spec in mind.

It should be possible to use variable term objects as identifier in bindings:

const bindings = ...
const a = factory.variable('a')
const term = bindings.get(a)

Then there is no need to open the leading ? discussion cause you can point to the Data Model spec that defines the value of a variable term like this:

value the name of the variable without leading "?" (example: "a").

jacoscaz commented 3 years ago

It should be possible to use variable term objects as identifier in bindings

This is a very sensible consideration IMHO, although I do wonder about the effects on performance (and complexity?) in long chains of transformations. That said, I would be 100% in favor of using object-based representation of variables if at all possible.

rubensworks commented 3 years ago

I do wonder about the effects on performance (and complexity?) in long chains of transformations.

I actually think this should be pretty ok performance-wise.

The only downside of this would be that it would be a bit less convenient for interface users to access values of a certain variable. But this is similar to the discussion around #6, as more dev-friendly abstractions can easily be built on top of this.

rubensworks commented 3 years ago

FYI, possibility for a call about this on the mailinglist: https://lists.w3.org/Archives/Public/public-rdfjs/2021Oct/0000.html

jacoscaz commented 3 years ago

PR to extend the discussion to everyone interested at https://github.com/rdfjs/query-spec/pull/7

jacoscaz commented 2 years ago

We've recently merged https://github.com/rdfjs/query-spec/pull/7, which includes and elaborates upon what was discussed in this issue. I think we can close this in favor of more focused issues - @rubensworks final word up to you!

rubensworks commented 2 years ago

Sounds good! Let's create new issues where needed based on the experimental interfaces in https://github.com/rdfjs/query-spec/blob/master/queryable-spec.ts

Once we're happy, we can create a proper spec.

For reference, I've started implementing these interfaces in @rdfjs/types in a new branch: https://github.com/rdfjs/types/tree/feature/query Experimenting with them in Comunica as we speak.

rdfjs / query-spec