Closed kevinoid closed 2 months ago
I also opened https://bugs.freebsd.org/263478 to get the FreeBSD developer's thoughts.
thanks for the note. the difference in behaviour relates to the difference in regular expression engines. the one in onetrueawk is quite old, dates back to 80s. in any case, as per opengroup/ieee standard, it is valid, not a syntax error. more on this anon.
I note that the regexp, while syntactically valid, is semantically nonsense. $
and ^
are always special in extended regular expressions, and this regexp has no meaning. Maybe the script trying to use it should be adjusted?
I note that the regexp, while syntactically valid, is semantically nonsense.
Could you explain what you mean? Although it's not a style I would prefer, it asserts that a position in the string matches both the end and start of the string ($^
), rather than asserting that it matches both the start and end of the string (^$
). The order of zero-width assertions is not semantically significant as far as I am aware.
However, I do agree that it should be changed, if for no other reason than compatibility. I had opened Debian Bug 1010041, since I don't have permission to open issues in the upstream project.
There is no formal way to say "matches a zero width string". The traditional regexp ^$
is "the beginning of the string immediately followed by the end of the string", which in practice means an empty string. But $^
means "the end of the string immediately followed by the beginning of the string" which doesn't make sense, as the end of the string is always after the beginning of the string. That is why I said that semantically the regexp is nonsense. Your statement that it matches both the end and the start of the string, isn't the right way to think about it. (At least, as I understand regular expressions.)
Actually, there is a way to say "matches a zero width string": .{0}
. But then you get into possible portability problems with versions of awk that don't accept interval expressions. ^$
is the simplest and most correct way to express what you're aiming at.
I hope this explanation makes sense and helps.
But $^ means "the end of the string immediately followed by the beginning of the string"
Thanks for explaining. Would you contend that the Busybox, GNU, and mawk behavior is incorrect, because /$^/
should not match an empty string?
I would suggest considering ^
and $
as special cases of lookaround assertions (sometimes called zero-width assertions or zero-width patterns) in other regular expression dialects (e.g. Perl, ECMAScript) where ^
means (?<!.)
(there is no preceding character at this position) and $
is (?!.)
(there is no following character at this position). That is to say, they match the absence of a character, rather than the presence of a beginning or end marker. Whether lookbehind assertions appear before or after lookahead assertions doesn't affect the set of matched strings.
You have just gone down a very dark, deep, and twisty rabbit hole. As in "Alice In Wonderland".
I am not going to follow you down there. :-)
Wearing my GNU AWK maintainer's hat, I will venture to say that, unless and until declared otherwise by POSIX, $^
may be syntactically valid but its behavior is undefined. Thus, the fact that various awks, including gawk, accept it and do something more or less reasonable is, in my not-so-humble opinion, due to luck; that's what they implement, but I doubt very much that it's by purposeful design.
I think I've reached the point of " 'nuff said".
Fair enough. Thanks again for your thoughts on it!
Another quick note: The distinction also has effects on whether repeated anchors (e.g. ^^
and $$
) can match anything. I wouldn't expect those to be common (or intended) but might appear in machine-generated expressions or ones combined naively (e.g. ^(exp1|^exp2|exp3)$
).
i'm keeping this issue open because while I may leave this unfixed for the moment due to other issues, the correct behaviour appears to be what is implemented in most regex libraries [and was implemented by henry spencer decades ago]. anchors are constraints: "there is a beginning of line here" and "there is an end of line here" and $^
is really legitimate when both overlap, as a synonym for ^$
as unusual as it looks.
This is really fodder for the POSIX committee. Maybe they even have something to say about it already. But that is who's lead I would follow. My two cents.
Using the regular expression
$^
as a pattern, for example by runningawk /$^/
, causes the following error:I encountered this error when running the
ident
script as part ofjscal-save
. The regular expression is a bit unconventional, but it satisfies the description in The AWK Programming Language book (and POSIX ERE) as far as I can tell, so I suspect it may be a bug. Curious what you think.I can confirm that the same error is produced when running commit 87b9493, so it's not a regression (at least within the git history). Support in other awk implementations is mixed: busybox and GNU awk support it, FreeBSD awk does not.
Thanks, Kevin