Closed silverneko closed 2 years ago
I've uploaded a patch that teaches the lexer how to parse expressions like /[/]/
.
I think this test case should be curated into the regression test as well, say t.reg
?
This is a duplicate of #81 which was not merged because it appears to violate POSIX. I have a more involved patch that checks whether POSIXLY_CORRECT is set and also handles escaped square brackets and other edge cases in my "reslash" branch. That change was committed to OpenBSD some time ago.
I think that's debatable, the POSIX text said:
Using a
character within an ERE requires the escaping shown in the following table. (the table shows is escaped as )
but it also said:
CONSEQUENCES OF ERRORS If any file operand is specified and the named file cannot be accessed, awk shall write a diagnostic message to standard error and terminate without any further action. If the program specified by either the program operand or a progfile operand is not a valid awk program (as specified in the EXTENDED DESCRIPTION section), the behavior is undefined.
This might not be the intended interpretation, but I read these text as:
/[\/]/
is an ERE and should be recognized by awk per POSIX standard./[/]/
is not an ERE defined by POSIX standard, but how this ERE-ish thing is interpreted is UB per POSIX standard, so awk implementations can interpret this string however they want.OTOH, what's one-true-awk's policy when there is a discrepancy between the book and POSIX? Afterall the first sentence of the README goes,
This is the version of awk described in The AWK Programming Language
so I think following POSIX isn't a hard requirement, and the book spec should be preferred as a tie breaker?
I think your change (different behavior depending on POSIXLY_CORRECT environ) is also a good tie breaker.
thanks for the discussion. i'm less concerned with the strict readings of POSIX or the book. for me the main issue is that the regex "/"
lexer is broken and inconsistent with RE in strings. i have not seen todd's fixes, will take a look.
hmm I think I may leave this alone and live with the inconsistency of having [/]
work inside strings.
this has been resolved in favour of consistency with other awk implementations. code courtesy of arnold robbins.
Hi, I've found some inconsistencies introduced by the previous change,
'/[][]/'
parses but '/[]a[]/'
does not
$ ./a.out '/[]a[]/'
./a.out: non-terminated regular expression []a[]/... at source line 1
context is
>>> /[]a[]/ <<<
/[[a]/
parses, but /[a[]/
does not
$ ./a.out '/[a[]/'
./a.out: non-terminated regular expression [a[]/... at source line 1
context is
>>> /[a[]/ <<<
/[/[:alpha:]]/
and /[/][[:alpha:]]/
parses, but /[[:alpha:]/]/
and /[[:alpha:]][/]/
doesn't
$ ./a.out '/[[:alpha:]/]/'
./a.out: nonterminated character class [[:alpha:]
source line number 1
context is
>>> /[[:alpha:]/ <<<
$ ./a.out '/[[:alpha:]][/]/'
./a.out: nonterminated character class [[:alpha:]][
source line number 1
context is
>>> /[[:alpha:]][/ <<<
These fail
$ ./a.out '/][[]/'
./a.out: non-terminated regular expression ][[]/... at source line 1
context is
>>> /][[]/ <<<
$ ./a.out '/][/]/'
./a.out: nonterminated character class ][
source line number 1
context is
>>> /][/ <<<
The latest change that parses combinations of ']' and '[' are error prone as shown by the corner cases above. I think we can use a simpler approach here, instead of doing complex bracketing matching in the lexer, the lexer should just match the outer-most '[' and ']' of a bracket expression, and the complex bracket matching of "[=", "[." and "[:" shall be ignored as the regex parser would check this later.
Ah yikes, my previous approach had bugs, turns out I still need to keep track of "[=", "[." and "[:" in the lexer, else /[.]/
would fail to parse.
Re uploaded a new patch, and verified by running the test suite.
oh what a mess we have gotten ourselves into.
According to the book, page 29, section 2.1:
Thus
/[/]/
should be grammatically equivalent to/\//
, which both matches any occurrence of a slash character.