json-schema-org / json-schema-spec

The JSON Schema specification
http://json-schema.org/
Other
3.72k stars 263 forks source link

Restrict supported regexes to not require dangerous features #663

Closed hauleth closed 5 years ago

hauleth commented 6 years ago

Currently it allows all ECMA 262 regexes while in "Understanding JSON Schema" Regexes page it lists features that "it is recommended that you stick to the subset of that syntax". However the problem is that with this feature "(?!x}, (?=x}: Negative and positive lookahead." a lot of engines cannot be trusted as this feature requires nondeterministic finite automata and by extension makes such language not regular by Chomsky hierarchy. NFA regular engines can cause problems with much simpler regexes like StackOverflow outage example shows.

Other problem with suggested "feature set" for JSON Schema regexes is that it cannot be trusted with regexes itself, because it can be used to DDoS incorrect regex implementations (if these use any form of finite automata) with "regex of doom" like a{100}{100}{100} which would try to build automata with 1 million nodes.

Therefore it would be highly suggested to define very small subset of allowed regex features that would be:

I think that officially supported feature set should contain:

WIth removal of back references and look arounds we make implementation of such engine much simpler and safer while covering most use cases.

We should probably also define some common atoms sets like \d, \w and \s for the case of the simplicity and supporting Unicode.

handrews commented 6 years ago

@hauleth thanks for filing this issue!

Regular expressions are always a challenge to balance. We currently address this by way of the Security Considerations section (in Validation up through draft-07, moving to Core in the next draft as patternProperties is moving to the Core specification as part of the applicator vocabulary).

This addresses the denial-of-service / regex of doom scenarios you mention, although generally I think the assumption is that JSON Schema implementations should use Regular Expression implementations that can guard against pathological behavior.

It is generally against the philosophy of JSON Schema to restrict Regular Expressions to the degree that you suggest for all uses of Regular Expressions.

However, particularly with the modular vocabulary changes coming in the next draft, you could easily define an alternate vocabulary that does restrict regular expressions, and forbids the keywords from the standard vocabularies that are more permissive. This is what I would recommend for applications that require more guarantees.

Julian commented 6 years ago

There are many many examples of the danger of taking untrusted user input for schemas, regexes are nowhere near the only one :)

uniqueItems on its own is enough of a way to DDoS.

There's no real general solution to these sorts of problems other than using the host OS's sandboxing mechanisms if you're going to allow untrusted schemas.

hauleth commented 6 years ago

My biggest gripe is that suggesting to support look arounds requires to use NFA engine, while there are some popular engines that are based on DFA which not supports such, notably:

So using ECMA 262 with look around and back references makes it hard to implement proper validators in such languages.

@Julian with uniqueItems the problem is a little "mitigated" by fact that it requires sending big JSON input, and can be solved fairly efficiently (O(n log n)) by sorting the data and linearly checking for duplicates. Not exactly memory efficient way (as we need to copy whole array), but time safe. On the other hand "regex of doom" like a{100}{100}{100} could disable unprepared machine with fairly small inputs.

Schema:

{
  "type": "text",
  "pattern": "a{100}{100}{100}"
}

Input:

"a"

Which will force to build FA with 1 million nodes before checking regex.

A little bit more of the input is required to fire exponential regex matching time:

{
  "type": "text",
  "pattern": "(a*)*c"
}

Input:

"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

This is 40 a symbols. With NodeJS regex engine such matching takes:

> console.time("regex");"a".repeat(10).match(/^(a*)*c$/);console.timeEnd("regex")
regex: 0.345ms
undefined
> console.time("regex");"a".repeat(20).match(/^(a*)*c$/);console.timeEnd("regex")
regex: 23.186ms
undefined
> console.time("regex");"a".repeat(30).match(/^(a*)*c$/);console.timeEnd("regex")
regex: 19251.854ms
undefined

With the input above I stopped waiting after a few minutes.

johandorland commented 6 years ago

Even though the spec says that implementations SHOULD use ECMA 262, I only read this as a recommendation in the end. In practice implementations will gear towards the regex flavour that their language provides, instead of going out of their way to find an ECMA 262 regex library.

As @Julian noted there are many other attack surfaces if you have to worry about untrusted schemas as input. What springs to my mind is all the shenanigans you can do with $ref, and I don't think your standard JSON schema implementation has the appropriate cycle detection to deal with that.

Having said that I wouldn't mind changing the recommended regex flavour from ECMA 262 to something like RE2. Although I'd generally wouldn't recommend running untrusted schemas without taking the proper precautions, at least regex engines like RE2 make it harder to shoot yourself in the foot.

Relequestual commented 6 years ago

It is generally against the philosophy of JSON Schema to restrict Regular Expressions to the degree that you suggest for all uses of Regular Expressions. - @handrews

I agree with this statement. We've seen the mess that is created when a specification takes a subset and makes a superset of another specification for it's own purposes... looks at OpenAPI/Swagger, and both sides have acknowledged this and tried to resolve it.

If you want to create a JSON Schema regex pre-checker, or any other pre validation validator, then go right ahead! =] - I'm keen to follow other specifications WITHOUT modifications. We shouldn't fail to learn from seeing previous similar issues.

BTW... rfc2119 on IETF key words... SHOULD:

This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

handrews commented 5 years ago

I think this has been covered pretty thoroughly. "I don't want my implementation vulnerable to some of the worst regex abuses" is a reason to disregard SHOULD.

I'd welcome any PR on improving the security considerations section, but I don't think there's anything specific to do here. If you want to restrict regexes in your implementation you are free to do so.