qri-io / jsonschema

golang implementation of https://json-schema.org drafts 7 & 2019-09
MIT License
459 stars 54 forks source link

JSON Schema Regex issues #47

Open utx0 opened 5 years ago

utx0 commented 5 years ago

I am currently working with a JSON Schema that is shared with a frontend application for building and validating a data structure that then get's passed to my backend go service which also uses the schema for data validation. Now the issue I'm having is the schema in question is using the regex pattern keyword for data validation and this works fine when its in the js world, however when using this schema with this jsonschema package I get the following error:

panic: error unmarshaling properties from json: error unmarshaling items from json: json: cannot unmarshal object into Go value of type []*jsonschema.Schema

goroutine 1 [running]:
github.com/qri-io/jsonschema.Must(0xc000166000, 0x776, 0x776)
        /Users/metta/golang/pkg/mod/github.com/qri-io/jsonschema@v0.0.0-20190413152851-094d15abc20e/schema.go:24 +0xbf

Now from my research go uses a different regex engine under the hood. And I am wondering how other deal with this when sharing a schema between different languages such as go and js?

b5 commented 5 years ago

Thankfully the spec has guidance on regex interop. Specifically:

given the high disparity in regular expression constructs support, schema authors SHOULD limit themselves to the following regular expression tokens:

individual Unicode characters, as defined by the JSON specification [RFC7159];
simple character classes ([abc]), range character classes ([a-z]);
complemented character classes ([^abc], [^a-z]);
simple quantifiers: "+" (one or more), "*" (zero or more), "?" (zero or one), and their lazy versions ("+?", "*?", "??");
range quantifiers: "{x}" (exactly x occurrences), "{x,y}" (at least x, at most y, occurrences), {x,} (x occurrences or more), and their lazy versions;
the beginning-of-input ("^") and end-of-input ("$") anchors;
simple grouping ("(...)") and alternation ("|").

So, basically, when writing a schema, try to keep the regex simple. But we can get a little more specific b/c you're dealing in go-to-js regex conversion. Javascript is supposed to use ECMA-262 regex, and go's regexp library is built to satisfy RE2. Wikipedia has a great comparison between the two:

Looks like the two biggest difference on the "basic features" are backreferences and look-ahead. So for your purposes it would be worth keeping those in mind. JS doesn't support any of the fancy (scary?) features on the second table, so we don't need to worry about those.

I consider it a bug that you didn't get a good, human-readable error when regex parsing fails due to these limitations. If you have any interest in filing a PR that adds a test and a better error message, I'd be deeply appreciative. A great place to start would be returning a better error message here: https://github.com/qri-io/jsonschema/blob/094d15abc20e286f43f992a2f032c05704fa5ab6/keywords_format.go#L258

dolmen commented 4 years ago

The implementation uses package regexp to validate pattern properties. But the regexp package does not implement the full regexp syntax of ECMA-262 used in the JSON Schema spec. This means that jsonschema is not compliant.

This problem is not even documented in the README or in package documentation.

b5 commented 4 years ago

I agree that this package is not 100% compliant with the spec, but not for this reason. The section from the spec you're quoting:

This string SHOULD be a valid regular expression, according to the ECMA 262 regular expression dialect.

I interpret the use of the word SHOULD in the above sentence to mean that both json schema implementers and spec authors SHOULD aim for ECMA-262. If fully implementing ECMA-262 were required to be considered part of the spec, I would expect the above sentence to use MUST instead of SHOULD.

Doing a full ECMA-262 spec implementation would require an extra import that isn't in the go standard library. We haven't had anyone ask for it yet, and I prefer to stick to standard library imports whenever possible, but if this is a major sticking point for people using this package, I'd be happy to talk through it.

This problem is not even documented in the README or in package documentation.

We'd gladly accept a pull request that documents the difference in regex parsing.

handrews commented 4 years ago

Yeah, regexes are a bit of an interoperability nightmare. As an example, you'll note that . (dot) is not included in that list of tokens that people should limit themselves to, yet it's one of the most commonly used regex tokens.

There's no great solution, the language in the spec is intended to indicate that JSON Schema inherits the same interoperability problems, and we pick one recommended standard with the understanding that none of the standards are universally supported.

joshuacc commented 2 years ago

I also ran into this issue when using negative lookbehind in the regex pattern, and it took me a while to understand what the problem was.

Any chance we could get clearer error message when this occurs?

dolmen commented 1 year ago

The Otto project (a JavaScript interpreter) has a function that fixes some incompatibilities between Go regexp and JS regexp.

But that will not add lookbehind.

dolmen commented 1 year ago

Module github.com/dlclark/regexp2 has an ECMAScript mode. But it is important for safety (protect against DoS) to set timeouts, so the jsonschema API must allow to set regexp time limits.