jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
29.59k stars 1.54k forks source link

Regular expression alternation (|) used with quantifier (* or +) returns inconsistent results when first alternative is able to match an empty string #3117

Closed elmaimbo closed 1 month ago

elmaimbo commented 1 month ago

Describe the bug When alternation is used in combination with a * or + quantifier - i.e. (XXX|YYY)* or (XXX|YYY)+ - you get unexpected results if the first alternative (XXX) can match an empty string.

To Reproduce The following base-line test produces the output ab:

$ jq -nr '"ab" | match("(?:a|b)*") | .string'
ab

To observe the issue, make the first alternative optional:

$ jq -nr '"ab" | match("(?:a?|b)*") | .string'
a

Interestingly, if you then swap the order of the two alternatives (meaning that the one that can match an empty string is no longer the first alternative) it fixes the problem:

$ jq -nr '"ab" | match("(?:b|a?)*") | .string'
ab

Expected behavior All three tests above should produce the same output.

Environment (please complete the following information):

pkoppstein commented 1 month ago

gojq and jaq both produce the correct result:

$ jq -nR '"ab" | match("(a?|b)*").string'
"a"
$ gojq -nR '"ab" | match("(a?|b)*").string'
"ab"
$ jaq -nR '"ab" | match("(a?|b)*").string'
"ab"
emanuele6 commented 1 month ago

gojq and jaq both produce the correct result:

jq and gojq don't use the same kind of regular expressions though. There are some jq regular expressions that won't work ingojq.


I don't think jq has any control over this issue anyway; match just calls to oniguruma and contructs an object based on the result.

pkoppstein commented 1 month ago

Apparently the observed behavior is expected of PCRE. (*)

Since jq deliberately uses the PCRE flavor of Oniguruma, it would therefore appear that the only question to emerge from this issue is whether in future jq should choose a different flavor.


(*) See https://regex101.com/