onetrueawk / awk

One true awk
Other
1.98k stars 159 forks source link

syntax error in regular expression $^ #145

Closed kevinoid closed 2 months ago

kevinoid commented 2 years ago

Using the regular expression $^ as a pattern, for example by running awk /$^/, causes the following error:

awk: syntax error in regular expression $^ at 
 source line number 1
 context is
     >>> /$^/ <<< 

I encountered this error when running the ident script as part of jscal-save. The regular expression is a bit unconventional, but it satisfies the description in The AWK Programming Language book (and POSIX ERE) as far as I can tell, so I suspect it may be a bug. Curious what you think.

I can confirm that the same error is produced when running commit 87b9493, so it's not a regression (at least within the git history). Support in other awk implementations is mixed: busybox and GNU awk support it, FreeBSD awk does not.

Thanks, Kevin

kevinoid commented 2 years ago

One more data point: mawk 1.3.4.20200120 accepts $^.

kevinoid commented 2 years ago

I also opened https://bugs.freebsd.org/263478 to get the FreeBSD developer's thoughts.

plan9 commented 2 years ago

thanks for the note. the difference in behaviour relates to the difference in regular expression engines. the one in onetrueawk is quite old, dates back to 80s. in any case, as per opengroup/ieee standard, it is valid, not a syntax error. more on this anon.

arnoldrobbins commented 2 years ago

I note that the regexp, while syntactically valid, is semantically nonsense. $ and ^ are always special in extended regular expressions, and this regexp has no meaning. Maybe the script trying to use it should be adjusted?

kevinoid commented 2 years ago

I note that the regexp, while syntactically valid, is semantically nonsense.

Could you explain what you mean? Although it's not a style I would prefer, it asserts that a position in the string matches both the end and start of the string ($^), rather than asserting that it matches both the start and end of the string (^$). The order of zero-width assertions is not semantically significant as far as I am aware.

However, I do agree that it should be changed, if for no other reason than compatibility. I had opened Debian Bug 1010041, since I don't have permission to open issues in the upstream project.

arnoldrobbins commented 2 years ago

There is no formal way to say "matches a zero width string". The traditional regexp ^$ is "the beginning of the string immediately followed by the end of the string", which in practice means an empty string. But $^ means "the end of the string immediately followed by the beginning of the string" which doesn't make sense, as the end of the string is always after the beginning of the string. That is why I said that semantically the regexp is nonsense. Your statement that it matches both the end and the start of the string, isn't the right way to think about it. (At least, as I understand regular expressions.)

Actually, there is a way to say "matches a zero width string": .{0}. But then you get into possible portability problems with versions of awk that don't accept interval expressions. ^$ is the simplest and most correct way to express what you're aiming at.

I hope this explanation makes sense and helps.

kevinoid commented 2 years ago

But $^ means "the end of the string immediately followed by the beginning of the string"

Thanks for explaining. Would you contend that the Busybox, GNU, and mawk behavior is incorrect, because /$^/ should not match an empty string?

I would suggest considering ^ and $ as special cases of lookaround assertions (sometimes called zero-width assertions or zero-width patterns) in other regular expression dialects (e.g. Perl, ECMAScript) where ^ means (?<!.) (there is no preceding character at this position) and $ is (?!.) (there is no following character at this position). That is to say, they match the absence of a character, rather than the presence of a beginning or end marker. Whether lookbehind assertions appear before or after lookahead assertions doesn't affect the set of matched strings.

arnoldrobbins commented 2 years ago

You have just gone down a very dark, deep, and twisty rabbit hole. As in "Alice In Wonderland".

I am not going to follow you down there. :-)

Wearing my GNU AWK maintainer's hat, I will venture to say that, unless and until declared otherwise by POSIX, $^ may be syntactically valid but its behavior is undefined. Thus, the fact that various awks, including gawk, accept it and do something more or less reasonable is, in my not-so-humble opinion, due to luck; that's what they implement, but I doubt very much that it's by purposeful design.

I think I've reached the point of " 'nuff said".

kevinoid commented 2 years ago

Fair enough. Thanks again for your thoughts on it!

Another quick note: The distinction also has effects on whether repeated anchors (e.g. ^^ and $$) can match anything. I wouldn't expect those to be common (or intended) but might appear in machine-generated expressions or ones combined naively (e.g. ^(exp1|^exp2|exp3)$).

plan9 commented 2 years ago

i'm keeping this issue open because while I may leave this unfixed for the moment due to other issues, the correct behaviour appears to be what is implemented in most regex libraries [and was implemented by henry spencer decades ago]. anchors are constraints: "there is a beginning of line here" and "there is an end of line here" and $^ is really legitimate when both overlap, as a synonym for ^$ as unusual as it looks.

arnoldrobbins commented 2 years ago

This is really fodder for the POSIX committee. Maybe they even have something to say about it already. But that is who's lead I would follow. My two cents.