Open albehrens opened 4 years ago
Hi @albehrens! You're certainly welcome for the lib – all the hard work is being done by apg-lib 🙃
The point you're bringing up has been discussed internally here, and our very first implementation actually made some attempts in that direction. However, in our use case the schemas specified by SCIM do not match our own internal schema: not only do field names differ, but the actual structure of entities, relational constructs and even concepts are quite different. The necessary transformations might be able to solved in the database layers using views, but even that assumes everything exists on a single database. For our use case, we only have single-digit thousands of users (as this is only being used on internal systems) so it was far easier to just load and transform the entities in code and then apply the filter.
I will say that while the examples show arrays being filtered, the returned functions can be used for observables or async iterators such that you don't have to load everything into memory at once.
All of that said, if you're interested in tackling this I can certainly point you in the right direction!
Essentially, the grammar is all defined in grammar.abnf. This was hand-written while referencing the formal syntax from the spec, but with a focus on practicality (left recursion, for example, isn't supported by most tools) and with a goal of naming each meaningful syntax feature. Running yarn generate
(which I just added while writing this) or apg -i grammar.abnf -o grammar.js
uses the apg-lib
package to "compile" this into grammar.js
. This JS data object is then used by the general-purpose parser on the runtime.
When a new filter expression is to be compiled, it first is parsed:
Then, we use APG's AST translation utilities to translate it, providing an instance of Yard
as a shared mutable state.
https://github.com/the-control-group/scim-query-filter-parser-js/blob/6cb88fe852fa4bbb75d0369bf36384e9abb854b6/src/index.ts#L52-L54
This works by traversing the parsed grammar, and calling the registerred callbacks as the corresponding symbols are encountered. https://github.com/the-control-group/scim-query-filter-parser-js/blob/6cb88fe852fa4bbb75d0369bf36384e9abb854b6/src/index.ts#L27-L43
Each of these "callbacks" manipulates the yard
state, and at the end, we return the first and only "filter" from the yard.
I created the Yard
shape based on the JS I'm trying to assemble. However, you will likely want to create your own state shape, and callbacks that manipulate it accordingly.
Obviously this is written in TypeScript. You can compile to JS by running yarn build
or yarn build:development
which watches for changes.
Tests are run from the compiled files by either yarn test
or yarn test:development
.
I often have two panes open in iTerm2: one building, one testing.
Hope that helps!
Hey @mike-marcacci! Thank you very much for your elaborate response. I will definitely have a look at the hints and resources that you gave. We "only" have a few thousand users per client as well. So scanning them before filtering is probably fine. We will probably look into this issue more closely if we actually run into a performance problem.
@mike-marcacci Really wish I saw this discussion earlier! Your motivation for a simple-to-use expression-based filterer makes a lot of sense now (N is small for you, and it's internal-use). As for the lazy-filtering, it's nice to be able to "stream" users into the filterer expression to avoid using tons of memory at any given time, but that still doesn't solve the problem of having to do the filtering processing outside of the DB layer (e.g. having to fetch all the results in the first place).
filterASTToSQLClause
problem.I agree that a correct filterASTToSQLClause
implementation is non-trivial to implement and may not even be worth it for the majority of use-cases. Not to mention, based on your design of this library, including the mature SQL compiled solution seems out of scope.
filterASTToSQLClause
problem.Let me pose a question. Is the following statement true?:
The vast majority of SCIM client use-cases are supplying filters of the form (other tokens added for context):
FILTER = attrExp / *1"not" "(" FILTER ")"
attrExp = (attrPath SP "pr") / (attrPath SP compareOp SP compValue)
attrPath = [URI ":"] ATTRNAME *1subAttr
ATTRNAME = ALPHA *(nameChar)
subAttr = "." ATTRNAME
compareOp = "eq" / "ne" / "co" / "sw" / "ew" / "gt" / "lt" / "ge" / "le"
compValue = false / null / true / number / string
Note that I've purposefully omitted the recursive logExp
and valuePath
alternatives (see RFC7644-3.4.2.2)
I think this is probably true. Why would a SCIM client ever be sending arbitrarily complex filter queries with crazy logExp
s? Almost every filtering use-case can be achieved by filtering based on a single field (e.g. 'nickName co "mike"'
) to narrow the result set, then just doing further filtering on the client. If we make this assumption and also limit the allowed compareOp
and attrPath
to whitelists, then augmenting a SQL query isn't that hard. As anecdata, it took me about a day of work to do this using V1's filterAST. Because of this, I have mostly lost interest in having a perfect filterASTToSQLClause
function. It seems like it would take days/weeks to implement correctly for little reward.
I think the solution here is to compromise! Unfortunately, due to the change from V1 to V2, the caller can no longer make the programmatic decision to throw a 501 if a filter is too complex, or to extract the attrPath
, compValue
, and compareOp
from the filter to supply to a higher performance DB-level query. This is a non-issue with pluggable compilers. Pluggable compilers gives us the best of both worlds. Imagine this:
import { compile, filterToPredicateCompiler, filterToASTCompiler, pathToASTCompiler } from "scim-query-filter-parser";
compile("filter", filterToPredicateCompiler); // returns our simple predicate/expression as in V2
compile("filter", filterToASTCompiler); // returns our AST as in V1
compile("path", pathToASTCompiler) // returns the compiled path as in my PR (#7)
Still very concise syntax, and simple to use. Overall, I think this could actually be a relatively simple change to your library (essentially just a ~minor refactor). Again, I'm happy to implement it. I think making this library pluggable could really enable Node.js devs to more quickly implement compliant SCIM servers, without sacrificing performance or customizability, or ease-of-use.
First of all I would like to thank the authors of this repository for their contribution. This package helped us to prototype our own SCIM server quickly. Unfortunately the way that this filter parser works is not very efficient. We have to load all our SCIM users into memory or at least scan our whole database.
Is there maybe an opportunity to translate your filter parser to a database query language like SQL or Elasticsearch's Query DSL? I would like to contribute and do it myself. I would only ask for some advise on where I should begin?