onetrueawk / awk

One true awk
Other
1.98k stars 159 forks source link

Ancient awk regexp compatibility bug #161

Closed arnoldrobbins closed 1 year ago

arnoldrobbins commented 2 years ago

Using the code from master as of today, I found the following bug. Given:

BEGIN {
    print match("abc-def", /[qrs---tuv]/)
}

The One True Awk prints a result of 0, whereas gawk and mawk print 4. Ancient awks (and I think it's even documented in the awk book) allowed a "range" of minus through minus to mean a real actual minus sign. The current code doesn't support this anymore.

plan9 commented 2 years ago

interesting find. i now think this must be a historic implementation wart. thanks

arnoldrobbins commented 1 year ago

Harumph. It looks like plain matching of --- works:

$ echo xxx-y | ./a.out '/[a---q]/'
xxx-y

So maybe it's just an issue with thematch() function?

mpinjr commented 1 year ago

Hi arnold, plan9:

(I completely changed the content of this post before anyone responded but quite a few hours after initially posting it. Hopefully I didn't cause any confusion or inconvenience.)

After looking at the code, I think I understand what's happening. cclenter in b.c does not understand the triple-minus idiom. When it detects an invalid range, where the end point precedes its starting point, it backs up and drops the range.

In the initial report, [qrs---tuv] becomes [qr-tuv] (invalid range s-- dropped). In the other example, [a---q] becomes [-q] (invalid range a-- dropped).

Take care, Miguel

plan9 commented 1 year ago

hi miguel, apologies for the late response, this is correct, I have tested and found this weeks ago. I'm not clear on a clean fix at the moment.

plan9 commented 1 year ago

triple minus should not be called an idiom. i think on its own, it's a twisted but legit construct that works with most [all?] regular expression engines [including one I built decades ago] but in combination with other characters in the range, may or may not work, depending on the engine. not many implementors will go through the kind of contortions eg. mawk regex engine goes through to handle this, nor should they.

arnoldrobbins commented 1 year ago

Let's close this issue since it's not going to change.