Translate parser to database query language

albehrens commented 4 years ago

First of all I would like to thank the authors of this repository for their contribution. This package helped us to prototype our own SCIM server quickly. Unfortunately the way that this filter parser works is not very efficient. We have to load all our SCIM users into memory or at least scan our whole database.

Is there maybe an opportunity to translate your filter parser to a database query language like SQL or Elasticsearch's Query DSL? I would like to contribute and do it myself. I would only ask for some advise on where I should begin?

mike-marcacci commented 4 years ago

Hi @albehrens! You're certainly welcome for the lib – all the hard work is being done by apg-lib 🙃

The point you're bringing up has been discussed internally here, and our very first implementation actually made some attempts in that direction. However, in our use case the schemas specified by SCIM do not match our own internal schema: not only do field names differ, but the actual structure of entities, relational constructs and even concepts are quite different. The necessary transformations might be able to solved in the database layers using views, but even that assumes everything exists on a single database. For our use case, we only have single-digit thousands of users (as this is only being used on internal systems) so it was far easier to just load and transform the entities in code and then apply the filter.

I will say that while the examples show arrays being filtered, the returned functions can be used for observables or async iterators such that you don't have to load everything into memory at once.

All of that said, if you're interested in tackling this I can certainly point you in the right direction!

Generating Grammar

Essentially, the grammar is all defined in grammar.abnf. This was hand-written while referencing the formal syntax from the spec, but with a focus on practicality (left recursion, for example, isn't supported by most tools) and with a goal of naming each meaningful syntax feature. Running yarn generate (which I just added while writing this) or apg -i grammar.abnf -o grammar.js uses the apg-lib package to "compile" this into grammar.js. This JS data object is then used by the general-purpose parser on the runtime.

When a new filter expression is to be compiled, it first is parsed:

https://github.com/the-control-group/scim-query-filter-parser-js/blob/6cb88fe852fa4bbb75d0369bf36384e9abb854b6/src/index.ts#L46-L50

Then, we use APG's AST translation utilities to translate it, providing an instance of Yard as a shared mutable state. https://github.com/the-control-group/scim-query-filter-parser-js/blob/6cb88fe852fa4bbb75d0369bf36384e9abb854b6/src/index.ts#L52-L54

This works by traversing the parsed grammar, and calling the registerred callbacks as the corresponding symbols are encountered. https://github.com/the-control-group/scim-query-filter-parser-js/blob/6cb88fe852fa4bbb75d0369bf36384e9abb854b6/src/index.ts#L27-L43

Each of these "callbacks" manipulates the yard state, and at the end, we return the first and only "filter" from the yard.

I created the Yard shape based on the JS I'm trying to assemble. However, you will likely want to create your own state shape, and callbacks that manipulate it accordingly.

General Building & Testing

Obviously this is written in TypeScript. You can compile to JS by running yarn build or yarn build:development which watches for changes.

Tests are run from the compiled files by either yarn test or yarn test:development.

I often have two panes open in iTerm2: one building, one testing.

Helpful links:

Hope that helps!

albehrens commented 4 years ago

Hey @mike-marcacci! Thank you very much for your elaborate response. I will definitely have a look at the hints and resources that you gave. We "only" have a few thousand users per client as well. So scanning them before filtering is probably fine. We will probably look into this issue more closely if we actually run into a performance problem.

jaylattice commented 4 years ago

@mike-marcacci Really wish I saw this discussion earlier! Your motivation for a simple-to-use expression-based filterer makes a lot of sense now (N is small for you, and it's internal-use). As for the lazy-filtering, it's nice to be able to "stream" users into the filterer expression to avoid using tons of memory at any given time, but that still doesn't solve the problem of having to do the filtering processing outside of the DB layer (e.g. having to fetch all the results in the first place).

This library should not solve the `filterASTToSQLClause` problem.

I agree that a correct filterASTToSQLClause implementation is non-trivial to implement and may not even be worth it for the majority of use-cases. Not to mention, based on your design of this library, including the mature SQL compiled solution seems out of scope.

Maybe nobody should solve the `filterASTToSQLClause` problem.

Let me pose a question. Is the following statement true?:

The vast majority of SCIM client use-cases are supplying filters of the form (other tokens added for context):

FILTER = attrExp / *1"not" "(" FILTER ")"
attrExp = (attrPath SP "pr") / (attrPath SP compareOp SP compValue)
attrPath = [URI ":"] ATTRNAME *1subAttr
ATTRNAME = ALPHA *(nameChar)
subAttr = "." ATTRNAME
compareOp = "eq" / "ne" / "co" / "sw" / "ew" / "gt" / "lt" / "ge" / "le"
compValue = false / null / true / number / string

Note that I've purposefully omitted the recursive logExp and valuePath alternatives (see RFC7644-3.4.2.2)

I think this is probably true. Why would a SCIM client ever be sending arbitrarily complex filter queries with crazy logExps? Almost every filtering use-case can be achieved by filtering based on a single field (e.g. 'nickName co "mike"') to narrow the result set, then just doing further filtering on the client. If we make this assumption and also limit the allowed compareOp and attrPath to whitelists, then augmenting a SQL query isn't that hard. As anecdata, it took me about a day of work to do this using V1's filterAST. Because of this, I have mostly lost interest in having a perfect filterASTToSQLClause function. It seems like it would take days/weeks to implement correctly for little reward.

@albehrens and my use-case motivates V3 of this library: Pluggable Compilers

I think the solution here is to compromise! Unfortunately, due to the change from V1 to V2, the caller can no longer make the programmatic decision to throw a 501 if a filter is too complex, or to extract the attrPath, compValue, and compareOp from the filter to supply to a higher performance DB-level query. This is a non-issue with pluggable compilers. Pluggable compilers gives us the best of both worlds. Imagine this:

import { compile, filterToPredicateCompiler, filterToASTCompiler, pathToASTCompiler } from "scim-query-filter-parser";
compile("filter", filterToPredicateCompiler); // returns our simple predicate/expression as in V2
compile("filter", filterToASTCompiler); // returns our AST as in V1
compile("path", pathToASTCompiler) // returns the compiled path as in my PR (#7)

Still very concise syntax, and simple to use. Overall, I think this could actually be a relatively simple change to your library (essentially just a ~minor refactor). Again, I'm happy to implement it. I think making this library pluggable could really enable Node.js devs to more quickly implement compliant SCIM servers, without sacrificing performance or customizability, or ease-of-use.

the-control-group / scim-query-filter-parser-js