voxpupuli / json-schema

Ruby JSON Schema Validator
MIT License
1.52k stars 241 forks source link

Issue with escape characters in pattern fields #421

Open AdrianTP opened 5 years ago

AdrianTP commented 5 years ago

I had issues with escape characters behaving oddly within patterns.

Basically, due to the combination of Ruby's JSON parser and Ruby's Regexp class both being involved (and the fact that the parsed string was treated as a double-quoted string, meaning that one level of escapes "occur" before it is even processed by the Regexp parser), escape sequences had to be triple- (or quadruple-) escaped:

irb > Regexp.new(JSON.parse('{"a":"\A\d\.\d\z"}')['a'])
 => /Ad.dz/
irb > Regexp.new(JSON.parse('{"a":"\\A\\d\\.\\d\\z"}')['a'])
 => /Ad.dz/
irb > Regexp.new(JSON.parse('{"a":"\\\A\\\d\\\.\\\d\\\z"}')['a'])
 => /\A\d\.\d\z/
irb > Regexp.new(JSON.parse('{"a":"\\\\A\\\\d\\\\.\\\\d\\\\z"}')['a'])
 => /\A\d\.\d\z/

I found it especially annoying if I need to match a literal backslash within a regex, which actually requires seven or eight backslashes (not six!):

irb > Regexp.new(JSON.parse('{"a":"\\\\\\\"}')['a'])
 => /\\/
irb > Regexp.new(JSON.parse('{"a":"\\\\\\\\"}')['a'])
 => /\\/

I found that using the Unicode escape sequence for a backslash helped keep things somewhat more clear (and less prone to error):

irb > Regexp.new(JSON.parse('{"a":"\u005CA\u005Cd\u005C.\u005Cd\u005Cz"}')['a'])
 => /\A\d\.\d\z/
irb > Regexp.new(JSON.parse('{"a":"\u005C\u005C"}')['a'])
 => /\\/

Obviously, using POSIX bracket expressions can help a bit, too, where applicable:

irb > Regexp.new(JSON.parse('{"a":"\u005CA[[:digit:]]\u005C.[[:digit:]]\u005Cz"}')['a'])
 => /\A[[:digit:]]\.[[:digit:]]\z/

I thought I would share this so people who didn't think of this earlier could maybe have a little push in the "right" direction. It seems super obvious in retrospect, but this actually caused a lot of headaches on my team -- especially among those who are new to regex or did not appreciate/understand the relationship between the "pattern" value in the schema file, the Ruby JSON parser, the way Ruby (and other languages) handle(s) strings (double-quoted vs single-quoted, etc.), and the Ruby Regexp parser. Other JSON parsers (such as Oj) also have weird behaviour with escape sequences -- different weird behaviour than Ruby's built-in JSON parser -- and the Unicode "trick" should work with those parsers as well.

If anyone has any better suggestions, I would love to hear them, because as useful as the Unicode workaround is, it is still a bit verbose and troublesome among the (many) programmers who never think about (or simply don't know much about) charsets, encoding, or regex.