awk cannot parse constant regex expression of slash character within a character class (/[/]/)

silverneko commented 2 years ago

$ make
$ ./a.out '/[/]/' 
./a.out: non-terminated regular expression [/... at source line 1
 context is                                                           
         >>> /[/ <<<                                                  
./a.out: nonterminated character class [/
 source line number 1

According to the book, page 29, section 2.1:

Inside a character class, all characters have their literal meaning, except for the quoting character \, ^ at the beginning, and ~ between two characters.

Thus /[/]/ should be grammatically equivalent to /\//, which both matches any occurrence of a slash character.

silverneko commented 2 years ago

I've uploaded a patch that teaches the lexer how to parse expressions like /[/]/. I think this test case should be curated into the regression test as well, say t.reg?

millert commented 2 years ago

This is a duplicate of #81 which was not merged because it appears to violate POSIX. I have a more involved patch that checks whether POSIXLY_CORRECT is set and also handles escaped square brackets and other edge cases in my "reslash" branch. That change was committed to OpenBSD some time ago.

silverneko commented 2 years ago

I think that's debatable, the POSIX text said:

Using a character within an ERE requires the escaping shown in the following table. (the table shows is escaped as )

but it also said:

CONSEQUENCES OF ERRORS If any file operand is specified and the named file cannot be accessed, awk shall write a diagnostic message to standard error and terminate without any further action. If the program specified by either the program operand or a progfile operand is not a valid awk program (as specified in the EXTENDED DESCRIPTION section), the behavior is undefined.

This might not be the intended interpretation, but I read these text as:

/[\/]/ is an ERE and should be recognized by awk per POSIX standard.
/[/]/ is not an ERE defined by POSIX standard, but how this ERE-ish thing is interpreted is UB per POSIX standard, so awk implementations can interpret this string however they want.

OTOH, what's one-true-awk's policy when there is a discrepancy between the book and POSIX? Afterall the first sentence of the README goes,

This is the version of awk described in The AWK Programming Language

so I think following POSIX isn't a hard requirement, and the book spec should be preferred as a tie breaker?

I think your change (different behavior depending on POSIXLY_CORRECT environ) is also a good tie breaker.

plan9 commented 2 years ago

thanks for the discussion. i'm less concerned with the strict readings of POSIX or the book. for me the main issue is that the regex "/" lexer is broken and inconsistent with RE in strings. i have not seen todd's fixes, will take a look.

plan9 commented 2 years ago

hmm I think I may leave this alone and live with the inconsistency of having [/] work inside strings.

plan9 commented 2 years ago

this has been resolved in favour of consistency with other awk implementations. code courtesy of arnold robbins.

silverneko commented 2 years ago

Hi, I've found some inconsistencies introduced by the previous change,

'/[][]/' parses but '/[]a[]/' does not

$ ./a.out '/[]a[]/'
./a.out: non-terminated regular expression []a[]/... at source line 1
context is
     >>> /[]a[]/ <<<

/[[a]/ parses, but /[a[]/ does not

$ ./a.out '/[a[]/'
./a.out: non-terminated regular expression [a[]/... at source line 1
context is
     >>> /[a[]/ <<<

/[/[:alpha:]]/ and /[/][[:alpha:]]/ parses, but /[[:alpha:]/]/ and /[[:alpha:]][/]/ doesn't

$ ./a.out '/[[:alpha:]/]/'
./a.out: nonterminated character class [[:alpha:]
source line number 1
context is
     >>> /[[:alpha:]/ <<<
$ ./a.out '/[[:alpha:]][/]/'
./a.out: nonterminated character class [[:alpha:]][
source line number 1
context is
     >>> /[[:alpha:]][/ <<<

These fail

$ ./a.out '/][[]/'
./a.out: non-terminated regular expression ][[]/... at source line 1
context is
     >>> /][[]/ <<<
$ ./a.out '/][/]/'
./a.out: nonterminated character class ][
source line number 1
context is
     >>> /][/ <<<

The latest change that parses combinations of ']' and '[' are error prone as shown by the corner cases above. I think we can use a simpler approach here, instead of doing complex bracketing matching in the lexer, the lexer should just match the outer-most '[' and ']' of a bracket expression, and the complex bracket matching of "[=", "[." and "[:" shall be ignored as the regex parser would check this later.

silverneko commented 2 years ago

Ah yikes, my previous approach had bugs, turns out I still need to keep track of "[=", "[." and "[:" in the lexer, else /[.]/ would fail to parse.

Re uploaded a new patch, and verified by running the test suite.

plan9 commented 2 years ago

oh what a mess we have gotten ourselves into.

onetrueawk / awk

awk cannot parse constant regex expression of slash character within a character class (/[/]/) #135