ietf-wg-jsonpath / draft-ietf-jsonpath-base

Development of a JSONPath internet draft
https://ietf-wg-jsonpath.github.io/draft-ietf-jsonpath-base/
Other
59 stars 20 forks source link

Regular expressions in filters #70

Closed glyn closed 2 years ago

glyn commented 3 years ago

Issue 17 proposed regular expressions in filters, which is supported by several implementations of JSONPath. This raises two questions:

  1. What is the precise syntax of regular expressions?
  2. How should ReDOS attacks be mitigated?

One approach which addresses both these questions would be to adopt RE2 syntax as discussed in ReDOS attacks, although not all languages may yet have support for RE2.

cabo commented 3 years ago

So you are saying JSONpath should only support regular expression literals, not strings with regular expressions.

danielaparker commented 3 years ago

So you are saying JSONpath should only support regular expression literals, not strings with regular expressions.

Unfortunately, no, I'm saying that at the time I wrote that sentence, I had forgotten that Perl/Python/Ruby et al support variables inside the "/" delimiters. I wish I could interpret it the way you suggested.

cabo commented 3 years ago

Do you mean as in (1) interpolating variables into the RE (#{name}) or (2) filling in variables with results of the RE scan (?P<name>...)?

So far, I haven't seen any proposal for variables in the JSONPath expression language that could go into either place.

cabo commented 3 years ago

RE2 syntax would have the advantage that nobody of us likes it :-) But seriously, any RE syntax that is not already documented will need to be defined in the document. That leaves the ECMAscript syntax of the week or W3C XSD syntax. ECMAscript is a bit anemic but changes all the time. W3C XSD REs are exactly what would be needed here, but are not very popular.

danielaparker commented 3 years ago

@cabo wrote:

Do you mean as in (1) interpolating variables into the RE (#{name}) or (2) filling in variables with results of the RE scan (?P<name>...)?

So far, I haven't seen any proposal for variables in the JSONPath expression language that could go into either place.

Perhaps an expression e.g. @.pattern that evaluated to a string should be allowed within "/" delimiters. I haven't thought this through, I only have experience implementing it with character literals, and the JSONPath comparison only has character literals. I'll submit a pull request with JSONPath comparison with a match against an expression, to see what everybody else does.

glyn commented 3 years ago

I would prefer to restrict the right hand side of a regular expression match (=~) to a literal regular expression. I might be persuaded that we should also allow a JSONPath expression which evaluates to a (single) literal regular expression. I really would like to avoid literal regular expressions that contain JSONPath expressions as that would mean not being able to reuse an existing regular expression syntax (such as RE2).

danielaparker commented 3 years ago

I would prefer to restrict the right hand side of a regular expression match (=~) to a literal regular expression.

After giving this some thought, I agree.

Introducing "/" delimiters for regular expressions suggests static "compilation" of the regular expression, as it does in perl/ruby/python. If in JSONPath we're applying a regular expression in a search of a million items, we wouldn't want the regular expression to depend on the data, as that would defeat the purpose of introducing the "/" delimiters in the first place. We wouldn't want to force implementations to "recompile" the regular expression every time the current node changed.

Also, I don't think my speculation above about allowing @.child inside "/" delimiters makes sense, as the symbols allowed within "/" delimiters are reserved for regular expression symbols.

These observations are consistent with the Jayway Java implementation of the "~=" operator.

In terms of functional requirements, regular expressions are very useful for search and filtering, but such expressions are generally statically known.

I might be persuaded that we should also allow a JSONPath expression which evaluates to a (single) literal regular expression.

I don't fully understand this sentence. JSONPath expressions don't produce character literals, character literals are a static concept. JSONPath expressions can produce strings, but that's not the same thing. The key feature of character literals is that they don't depend on the data, and in particular on the current node @.

I really would like to avoid literal regular expressions that contain JSONPath expressions as that would mean not being able to reuse an existing regular expression syntax (such as RE2).

I don't think anybody would propose that!

cabo commented 3 years ago

On 2021-03-15, at 12:51, Daniel Parker @.***> wrote:

I really would like to avoid literal regular expressions that contain JSONPath expressions as that would mean not being able to reuse an existing regular expression syntax (such as RE2).

I don't think anybody would propose that!

Well, then let me propose that :-)

Building a RE out of data that has been found in the input is not that outlandish, I think. I’m not sure that we ultimately want to support that, but it is not something that is self-evidently excluded.

Grüße, Carsten

glyn commented 3 years ago

I might be persuaded that we should also allow a JSONPath expression which evaluates to a (single) literal regular expression.

I don't fully understand this sentence. JSONPath expressions don't produce character literals, character literals are a static concept. JSONPath expressions can produce strings, but that's not the same thing. The key feature of character literals is that they don't depend on the data, and in particular on the current node @.

Pardon my confusion, but what I had in mind was to allow the likes of ... =~ $.a where $.a evaluates to a regular expression such as /x.*y/.

danielaparker commented 3 years ago

@glyn wrote:

Pardon my confusion, but what I had in mind was to allow the likes of ... =~ $.a where $.a evaluates to a regular expression such as /x.*y/.

One minor point, I believe it should say that it evaluates to a string "x.*y", rather than character literals /x.*y/.

Otherwise, I believe that that's technically sound, and permits an efficient implementation, in the sense of "compile" regular expression once and apply many times. And it's somewhat more likely that there is a regular expression to be found at an absolute location in the JSON document, as opposed to in all current nodes :-) I still think it's an unlikely user requirement. If you do become persuaded to support this syntax, it probably won't be because I've tried to persuade you.

cabo commented 3 years ago

On 2021-03-15, at 17:17, Daniel Parker @.***> wrote:

And it's somewhat more likely that there is a regular expression to be found

The expression language could also contain something like RegExp.escape, and it is likely to be helpful to compose the regular expression out of constant and variable (from input item) parts. What’s out there in today’s implementations?

Grüße, Carsten

danielaparker commented 3 years ago

@cabo wrote:

The expression language could also contain something like RegExp.escape, and it is likely to be helpful to compose the regular expression out of constant and variable (from input item) parts. What’s out there in today’s implementations?

For implementations that rely on Javascript/Python/Perl/PHP et al for a filter predicate implementation, the capabilities of the regular expression evaluator are of course the capabilities of the scripting language, but I think the support for providing input for the evaluator are quite limited, perhaps limited to a string with pre-evaluated @.something terms.

As it turns out, no implementation supports @.something terms in regular expressions.

gregsdennis commented 3 years ago

...what I had in mind was to allow the likes of ... =~ $.a where $.a evaluates to a regular expression such as /x.*y/.

I could get behind this.

But I definitely wouldn't want to "construct" a RegEx from input data, as others have also been opposed to.

cabo commented 3 years ago

On 2021-03-15, at 18:01, Daniel Parker @.***> wrote:

Do you have some candidate syntax in mind that you would like to have investigated?

It would easy to come up with something, but I’d rather not invent something — if there is no current practice for this, we should not include it.

Grüße, Carsten

timbray commented 3 years ago

On 2021-03-15, at 18:01, Daniel Parker @.***> wrote: Do you have some candidate syntax in mind that you would like to have investigated? It would easy to come up with something, but I’d rather not invent something — if there is no current practice for this, we should not include it. Grüße, Carsten

Agree. Reminder: Charter says we're supposed to specify something "based on the common semantics and other aspects of existing implementations". So there's a judgment call to be made as to whether enough of them support a common regex that we can reasonably claim it constitutes "common semantics". My impression had been No, but it's months since I looked at the overlap chart.

gregsdennis commented 3 years ago

I can tell you neither of mine support RegEx (one is obsolete now). But I'm open to adding it to the current one.

bettio commented 3 years ago

I would really wish support for regex, and honestly I think they are nearly a must. However I'm really worried about all the implications behind them.

We are working hard for making sure that any future implementation is going to behave consistently with any other which follows our work.

I'm worried that all the complexity and differences behind regex is going to reintroduce back all the discrepancies we are trying hard to remove.

I think that If we wish any regex support we need to go for a regex syntax which is well specified and comonly implemented.

Honestly I'm not happy with RE2 that has been mentioned before, as far as I know it is available to most of languages as a C library through bindings (bindings might be really problematic).

For instance I wasn't happy with RE2 library while working with Erlang and Elixir, among all the reasons, using it would force my JSONPath implementation to depend on native code using a NIF while the whole library is 100% Elixir compiled into BEAM byte code.

IMHO we might investigate a Perl-like or POSIX-like regex support (that are quite popular). But again we should avoid getting stuck into all the regex complexities.

I have also some further points, questions and doubts:

  1. I think that regex should be supported only as a constant, therefore they must not be taken from a JSONPath evaluation result (e.g. disallow $[?(@.foo =~ @.regex)]). We might change idea on a future revision, but right now I'm worried that this might introduce further complexity, discrepancies and performance penalities (we can parse and compile the regex while parsing the JSONPath).

  2. are we going to treat regex as strings when using == comparison (e.g. $[?(@.foo == /[a-z]*/)] is true when foo is "[a-z]*") or are we going to use any special type, therefore == will never be true when comparing with a string?

  3. should we raise a parsing time error if a ill-formed regex is provided (e.g. /[a-z/)?

cabo commented 3 years ago

My 2 ¢:

Any RE syntax we decide to use will need to be thoroughly documented in JSONPath, except if it is one of the ECMAscript versions or W3C XSD syntax. (I do not really consider Posix a candidate, it is a bit anemic.) Of course, we don't need capture groups, global substitute, and can probably live without back references etc. That makes W3C XSD syntax my favorite, but it is also less widely implemented than others. It's 2021, so strings and REs will be Unicode text.

Is it (1) string =~ RE, or (2) RE =~ string, or (3) both or even (4) both but with subtly different semantics (horror experience from the Ruby language)?

  • I think that regex should be supported only as a constant, therefore they must not be taken from a JSONPath evaluation result (e.g. disallow $[?(@.foo =~ @.regex)]). We might change idea on a future revision, but right now I'm worried that this might introduce further complexity, discrepancies and performance penalities (we can parse and compile the regex while parsing the JSONPath).

That is a sensible approach. Unless there is a large installed base that depends on computed REs, I agree. We could make sure that a later extension can be added seamlessly.

  • are we going to treat regex as strings when using == comparison (e.g. $[?(@.foo == /[a-z]*/)] is true when foo is "[a-z]*") or are we going to use any special type, therefore == will never be true when comparing with a string?

We have the luxury of being able to provide strong typing, so a RE literal should be a (syntax?) error except near a =~.

  • should we raise a parsing time error if a ill-formed regex is provided (e.g. /[a-z/)?

Yes. If we can check a query before applying it to input, we should do that with failing fast.

bettio commented 3 years ago

I tried to query this example [1] using a JSONPath with a XSD regex

I tried using $.store.book[?(@.title =~ /[a-z-[s-u]]/)] (which is equivalent to $.store.book[?(@.title =~ /[a-rv-z]/)]):

Jayway: fails to understand it and returns a "bogus" result (but it doesn't fail with the "regular" regex)

Personal opinion about XSD regex: I feel like that no implementation is supporting them (I'm not aware of any), I feel like that every implementation supports whatever the language provides as builtin regex implementation (which is likely far from XSD regex syntax). Please, let me know if any implementation is supporting that syntax. If none supports that syntax, are we really sure to go for a syntax that nobody is using? Furthermore I didn't find any online tool to test regex with XSD style, conversely, PCRE style regex are easy to test (this is not really encouraging).

I also did few tests with some online test tools, following implementations just fail to understand any regex (I tried with "basic" syntax $.store.book[?(@.title =~ /[a-z]/)]):

Personal opinion: previous results make me feel skeptic about the whole regex feature. Do we know which implementations are supporting regex?

[1]:

{
    "store": {
        "book": [
            {
                "category": "reference",
                "author": "Nigel Rees",
                "title": "a",
                "price": 8.95
            },
            {
                "category": "fiction",
                "author": "Evelyn Waugh",
                "title": "s",
                "price": 12.99
            },
            {
                "category": "fiction",
                "author": "Herman Melville",
                "title": "x",
                "isbn": "0-553-21311-3",
                "price": 8.99
            },
            {
                "category": "fiction",
                "author": "J. R. R. Tolkien",
                "title": "The Lord of the Rings",
                "isbn": "0-395-19395-8",
                "price": 22.99
            }
        ],
        "bicycle": {
            "color": "red",
            "price": 19.95
        }
    },
    "expensive": 10
}
bettio commented 3 years ago

But there are online XPATH 3 evaluators. Have you tried testing regex with them?

As far as I know XPath/XQuery regex are different than XML Schema ones:

Note that the regular expression functions available in XQuery and XPath use a different regular expression flavor. This flavor is a superset of the XML Schema flavor described here. It adds some of the features that are available in many modern regex flavors, but not in the XML Schema flavor.

I feel like there is a lack modern tooling for testing XML Schema flavor regex (outside XML schemas), and this might be quite frustrating for end users.

By the way JSON Schema is borrowing syntax from ECMA 262.

bettio commented 3 years ago

I did some further investigation into regex support (using cburgmer's tool and his implementations list).

I usued a quite simple regex [a-rv-z] (JSON input was: [{"a": "a"}, {"b": "s"}] and JSONPath was: $[?(@.a =~ /[a-rv-z]/)]).

They are supported by 11 implementations in 7 different languages [1], interestingly JavaScript implementations were not supporting it, and 3/4 of them were completely misbehaving [2]:

I'm worried that users might expect to use regex to reliably validate/filter out invalid inputs, but they get accepted anyway.

Instead $[?(@.a =~ "[a-rv-z]")] is not supported at all.

Also I tested a XSD-like syntax, $[?(@.a =~ /[a-z-[s-u]]/)] is supported by a further limited subset of implementations (5/11).

[1] List of implementations supporting $[?(@.a =~ /[a-rv-z]/)]: C++ (1/1 implementation):

Go (2/6 implementations):

Java (2/2 implementations):

Objective-C (1/1 implementation):

Ruby (1/1 implementation):

PHP (3/4 implementations):

dotNET (1/4 implementations):

[2] JavaScript misbehaving implementations:

cabo commented 3 years ago

Interesting. While it may be a bit early to actually decide anything about regular expressions, this also seems to show some differences in expression evaluation. Can you do a similar check with == "a" and == "b"?

bettio commented 3 years ago

@cabo: it's fine. Let me know what kind of output are you looking for, so I can summarize it in a helpful way.

danielaparker commented 3 years ago

By the way JSON Schema is borrowing syntax from ECMA 262.

@bettio, what is your opinion about ECMA 262? All other things equal, and when it doesn't matter too much, I am strongly in favour of staying consistent with other standards, whether IETF recommendation or de facto standards such as JSON Schema.

Daniel

goessner commented 3 years ago

... this might be a way to go. Sharing a small, well defined subset of tc39 with JSON Schema.

timbray commented 3 years ago

I think the purpose of the RFC is to give advice to implementers and users as to what works. It's very unlikely that any existing implementation will change its semantics because of our work - but new ones will follow it if we do a good job. So I'd like the RFC to say something like "Some implementations support regular expression filters in JSONPaths; maximum interoperability can be achieved by using subset $X."

In particular, since regexes exhibit an 80/20 rule - you get 80% of the benefit with 20% of the features. Another reason a modest subset seems like a good idea for the JSONPath RFC.

On Thu, Mar 25, 2021 at 11:36 AM Stefan Goessner @.***> wrote:

... this might be a way to go. Sharing a small, well defined subset of tc39 https://tc39.es/ecma262/#sec-regexp-regular-expression-objects with JSON Schema.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ietf-wg-jsonpath/draft-ietf-jsonpath-base/issues/70#issuecomment-807253161, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEJE3AIFKRKWWMCYHT5TLTFN7DTANCNFSM4ZBSRVIQ .

gregsdennis commented 3 years ago

Some historical context: JSON Schema specifies 262 because that was what was available when the feature was added. It maintains 262 because that's what implementations now support.

I don't think this is cause enough for us to adopt it here. We should look at which RegEx specification serves our needs and use that. If it also happens to be 262, that's fine.


On a side note of implementation support, neither of my implementations support RegEx, but I'm happy to add it. I expect other maintainers wouldn't complain about it if it became a requirement.

bettio commented 3 years ago

@timbray:

Another reason a modest subset seems like a good idea for the JSONPath RFC.

The modest subset is what actually really works in a consistent manner: $.key is supported by 100% implementations, and Goessner's article is enough for that (I believe that some "basic" features are not supported by every implementation just because they are incomplete implementations, so no RFC is going to fix an unfinished or unmaintained project).

I don't see any useful reason about documenting just the "modest subset" (and I don't think it is worth it), also I think that is not compatible with our WG charter:

I will quote Barry Leiba, that asked to change the original charter.

The original charter:

The WG will develop a standards-track JSONPath specification, with the primary goal of capturing the common semantics of existing implementations and, where there are differences, choosing semantics with the goal of causing the least disruption among JSONPath users.

Barry Leiba:

I fear that this text appears to say that we primarily want to develop a dumbed-down compromise mush, and I'm sure that's not what we really mean. The primary goal is, surely, to develop a specification for JSONPath that is technically sound, complete, and useful.

And the following has been proposed:

The WG will develop a standards-track JSONPath specification that is technically sound and complete, based on the common semantics and other aspects of existing implementations. Where there are differences, the working group will analyze those differences and make choices that rough consensus considers technically best, with an aim toward minimizing disruption among the different JSONPath implementations.

Honestly what I found hard was taking decisions about quite commonly implemented "extensions" and a number of corner cases, and users love extensions (such as regex) because they find them useful.

@timbray: however if you are willing to work on it, we might document a "JSONPath Core" subset, that can be safely used across implementations for portability reasons, and for transitional purposes, but I don't agree about just limiting to it.

Sorry for the OT, let's continue the regex topic.

Edit: I wrote a quick proposal that tries to match the 2 point of view I see here (and in other discussions): https://github.com/ietf-wg-jsonpath/draft-ietf-jsonpath-base/issues/78#issuecomment-807810986

bettio commented 3 years ago

Talking about regex: I think we should start again from requirements. I'll start with some of them:

glyn commented 3 years ago

I think we all agree:

Fortunately, there's a clean syntactic and semantic separation between JSONPath and regular expression (de facto or de jure) standards.

I suggest we then stipulate RE2 for its security properties and unicode support (with "SHOULD" language) and allow for other regular expression standards (with "MAY" language).

JSONPath implementations that share a regular expression standard will interoperate on JSONPaths involving regular expressions. JSONPath implementations which don't share a regular expression standard may or may not interoperate on particular JSONPaths, depending on the regular expressions used.

In terms of compliance testing, it may be sufficient to include a few simple regular expressions which will work "everywhere".

gregsdennis commented 3 years ago

I suggest we then stipulate RE2 for its security properties and unicode support

I see your intent here, but I have to say that I'm not a fan of pure wrappers (especially as I work in .Net), and it seems that all of the implementations for RE2 are just wrappers for the C++ implementation. There are a lot of benefits to developers when a library is written in the language. I don't think we should ignore that.

Let's make sure that this comparison is considered in this decision.

glyn commented 3 years ago

Go has RE2 built in (https://golang.org/pkg/regexp/), but point taken. I agree that a wrapper is to be avoided, especially if it makes shipping a static binary impossible.

Which other regex (de facto or de jure) standards support unicode?

cabo commented 3 years ago

It's 2021, so all (notable) RE standards support Unicode.

I think, for once, recent drafts from json-schema.org got this right: Let's define a common subset, and reference that to an existing written up standard such as ECMAScript. (json-schema.org got the idea right, not necessarily the specific subset -- there is quite some fluff in that.)

If nobody else wants, I'll do a PR with an ABNF definition of such a subset. We can then debate what to leave in/throw out/add.

gregsdennis commented 3 years ago

Example StackOverflow question that could be solved using RegEx.

gregsdennis commented 2 years ago

Another question on the usage of RegEx in filters.

glyn commented 2 years ago

I find it interesting that JMESPath does not support regular expressions in filters. See, for example, this issue.

cabo commented 2 years ago

The discussion found in this issue pretty much is the reason why iregexp is the right approach. (The alternative, escaping into a platform-specific function, is not available for the present standardization effort.)

glyn commented 2 years ago

Is iregexp the right approach for JMESPath too? If so, I wonder if the JMESPath community would be interested in collaborating on iregexp?

cabo commented 2 years ago

We could ask! We can use all the help we can get.

cabo commented 2 years ago

Well, maybe I can get iregexp-02 out first...

glyn commented 2 years ago

Sounds good!

cabo commented 2 years ago

112 output:

Action: Take this discussion to the list