jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.58k stars 1.58k forks source link

Regular expression ^ never matches after newline #2562

Open annettejanewilson opened 1 year ago

annettejanewilson commented 1 year ago

Describe the bug jq regular expressions are always handled as if "single line" mode is enabled. The "single line" flag has no effect.

To Reproduce

jq -n '"\nx"|[match("^x")]'

Expected output:

[
  {
    "offset": 1,
    "length": 1,
    "string": "x",
    "captures": []
  }
]

Actual output:

[]

Expected behavior If the "s" flag and the "p" flag are not passed, then the ^ should match at the start of all lines, not just the first, and $ should match at the end of all lines, not just the last.

Environment (please complete the following information):

Additional context This appears to be because Oniguruma in PERL_NG syntax defaults to single line mode and must be passed a flag to negate it.

I will provide a PR.

pkoppstein commented 1 year ago

It might be helpful to begin by pointing out that all three of the major implementations of jq exhibit the same behavior, even though they use different RE engines. Some other regex processors behave in the same way too, e.g. awk:

$ jq -nM '"\nx"|[match("^x")]' []

$ gojq -nM '"\nx"|[match("^x")]' []

$ jaq -n '"\nx"|[match("^x" )]' []

$ awk 'BEGIN { if ("\nx" ~ /^x/) {print "match"}}' $

Furthermore, the three implementations of jq support the (?m) convention for achieving the alternative interpretation of "^":

$ jq -nM '"\nx"|test("(?m)^x")' true

$ gojq -nM '"\nx"|test("(?m)^x")' true

$ jaq -n '"\nx"|test("(?m)^x")' true

To understand what's going on here in the case of the C implementation of jq, the Oniguruma manual can be consulted. Since jq uses the Perl_NG flavor of regex, the ONIG_SYNTAX_PERL section of Appendix A-1 applies (see https://github.com/kkos/oniguruma/blob/master/doc/RE):

     (?s): dot (.) also matches newline
     (?m): ^ matches after newline, $ matches before newline

That is, these modifiers are required to achieve the alternative behavior.


Postscript: All 8 of the regex engines available at https://regex101.com/ require the "m" option in accordance with the above.


@itchyny - I think it's safe both to remove the bug label, and to close the issue.

itchyny commented 1 year ago

Thanks for investigation.

annettejanewilson commented 1 year ago

At the very least, this is a documentation issue. The "s" flag doesn't do anything! It's misleading to document it without noting that it's useless. It's also quite a tripping hazard that the "s" and "m" flags (as passed as the second argument to regex functions) have swapped meanings from the options passed inside the regex:

Default behaviour
^ matches only at start of string, . does not match newlines
s flag
No effect
m flag
Causes . to match newlines
(?s) option
Causes . to match newlines
(?m) option
Causes ^ to match after newlines
annettejanewilson commented 1 year ago

I think the source of the confusion is that Oniguruma uses the terms differently from Perl and the other regex engines I'm familiar with. We're telling Oniguruma to interpret the regex (including options in the regex) using Perl's meanings (single-line-mode means dot-matches-all, multi-line-mode means anchors-match-at-newlines), but we're using Oniguruma's interpretations (multi-line-mode is dot-matches-all and single-line-mode is anchors-DON'T-match-at-newlines) in the flags and documentation. I think it's a bit of a compatibility/consistency nightmare. Ideally only one interpretation would be exposed to the user, but if you want to preserve compatibility it's a bit late for that.

pkoppstein commented 1 year ago

@annettejanewilson - I think the problem is just that the jq documentation is wrong, apparently because of a failure to distinguish properly between the single-letter options allowed in "extended groups" and the single letters allowed in FLAGS. Anyway, I'm working on it. Thanks for your help.