Open zufuliu opened 11 months ago
I am going to work on this! ( I'm hoping that no one is working on it yet. In case yes, please lmk. )
Hi, I was looking to fix this and need a small clarification.
I went through the above examples, existing tests and the cpp reference, it says the following:
match_prev_avail | --first is a valid iterator position. When set, it causes match_not_bol and match_not_bow to be ignored.
I wanted to understand why we expect 0
from the below examples and not 1
. ?
string s = "ab";
string::iterator start = s.begin() + 1;
regex re0("^");
// Q: There is no match_not_bol
and match_not_bow
to ignore. --first is a valid iteration position. So why it should be 0?
cout << regex_search(start, s.end(), re0, regex_constants::match_prev_avail) << endl;
// Q: There is match_not_bol
to ignore. Ignoring it makes sure that we can match ^ with [first,first). --first is a valid iteration position. So why it should be 0?
cout << regex_search(start, s.end(), re0, regex_constants::match_prev_avail | regex_constants::match_not_bol) << endl;
I think I'm misunderstanding this part from reference. `--first is a valid iterator position`. Please let me know what I'm missing. Thanks!
Updated test to show matched position and length, https://godbolt.org/z/M6K3qraGh
include <string>
#include <regex>
#include <iostream>
using namespace std;
void show_match_result(const smatch &match) {
if (match.empty()) {
cout << false << endl;
} else {
cout << true << ": " << match.position(0) << ", " << match.length(0) << endl;
}
}
// https://github.com/llvm/llvm-project/issues/74838
int main() {
string s = "ab";
string::const_iterator start = s.cbegin() + 1;
regex re0("^");
smatch match;
cout << "^" << endl;
regex_search(start, s.cend(), match, re0, regex_constants::match_default);
show_match_result(match);
regex_search(start, s.cend(), match, re0, regex_constants::match_not_bol);
show_match_result(match);
regex_search(start, s.cend(), match, re0, regex_constants::match_prev_avail);
show_match_result(match);
regex_search(start, s.cend(), match, re0, regex_constants::match_prev_avail | regex_constants::match_not_bol);
show_match_result(match);
// https://github.com/llvm/llvm-project/issues/41544
regex re1("^ab");
cout << "\n\n" << "^ab" << endl;
regex_search(s, match, re1, regex_constants::match_default);
show_match_result(match);
regex_search(s, match, re1, regex_constants::match_not_bol);
show_match_result(match);
regex_search(s, match, re1, regex_constants::match_prev_avail);
show_match_result(match);
regex_search(s, match, re1, regex_constants::match_prev_avail | regex_constants::match_not_bol);
show_match_result(match);
regex re2("^b");
cout << "\n\n" << "^b" << endl;
regex_search(start, s.cend(), match, re2, regex_constants::match_default);
show_match_result(match);
regex_search(start, s.cend(), match, re2, regex_constants::match_not_bol);
show_match_result(match);
regex_search(start, s.cend(), match, re2, regex_constants::match_prev_avail);
show_match_result(match);
regex_search(start, s.cend(), match, re2, regex_constants::match_prev_avail | regex_constants::match_not_bol);
show_match_result(match);
return 0;
}
Can someone help me understand what I am missing here or point me to any reference. Thanks!
@SanjayMarreddi
Your first question:
Q: There is no
match_not_bol
andmatch_not_bow
to ignore.--first
is a valid iteration position. So why it should be0
?std::string s = "ab"; std::regex_search(s.begin() + 1, s.end(), std::regex("^"), std::regex_constants::match_prev_avail)
Reproducer: https://godbolt.org/z/Msec8Mcjz
My understanding of libstdc++'s behavior is that they never match "^"
when match_prev_avail
is used, because then they "know" that first
is not the true beginning of the sequence (in the sense that a previous iterator position is available). So they make ^
never match the beginning of a sequence in that case. I'm not certain that makes sense, and IMO the libc++ behavior is the one that seems correct to me, but I can't back this with any reference. TLDR, we have the same understanding at the moment and I'd be curious to hear what @jwakely thinks about this.
Your second question:
Q: There is
match_not_bol
to ignore. Ignoring it makes sure that we can match^
with[first, first)
.--first
is a valid iteration position. So why it should be0
?std::string s = "ab"; std::regex_search(start, s.end(), std::regex("^"), std::regex_constants::match_prev_avail | std::regex_constants::match_not_bol)
Reproducer: https://godbolt.org/z/vrTo3zzT4
Same reasoning here, this seems valid to me.
@zufuliu Can you explain why you think libc++ has the wrong behavior here? Here's a slightly more verbose version of your test cases after playing around with it: https://godbolt.org/z/hTfGra1s8
Can you explain why you think libc++ has the wrong behavior here?
Just my understanding for match_prev_avail
, i think it's an efficient (only dereference --first
as needed) replacement for following snippet:
std::regex_constants::match_flag_type flag = ...;
const char chPrev = *--first;
if (regex multiline mode) {
if (chPrev != '\n' && chPrev != '\r') {
flag |= std::regex_constants::match_not_bol;
}
} else {
flag |= std::regex_constants::match_not_bol;
}
if (isalnum(chPrev)) {
flag |= std::regex_constants::match_not_bow;
}
as match_not_bol
and match_not_bow
can be detected from --first
, so they are ignored from user supplied argument.
updated test case (added one multiline test, still lack test for match_prev_avail + match_not_bow
), https://godbolt.org/z/zc4GsGEKM
MSVC:
"^" matching "b"
match: position 0, length 0
no match
no match
no match
"^ab" matching "ab"
match: position 0, length 2
no match
"^b" matching "b"
match: position 0, length 1
no match
no match
no match
"^b" matching "b
b"
match: position 0, length 1
match: position 2, length 1
match: position 2, length 1
match: position 2, length 1
I must admit I don't understand the purpose of match_prev_avail
. I read the spec and I understand what it mentions about ignoring match_not_bow
and match_not_bol
, but the behavior seems somewhat underspecified to me and I'm far from a <regex>
expert so I don't have a lot of context.
I looked at Boost.Regex from which the Standard regex was originally created, and while this is by no means authoritative, it seems that Boost.Regex agrees with libstdc++'s interpretation: https://godbolt.org/z/cY64zrhfr
In other words, it does look like match_prev_avail
is intended to convey that we're in fact not really matching on [first, last)
, but on [first-1, last)
instead. Under that interpretation, it is clear that "^"
should never match when match_prev_avail
is used.
@SanjayMarreddi I think it's safe to treat this issue by assuming that libstdc++'s interpretation is the right one.
match_not_bol
, we expect the result to be match at position 2
as mentioned in the msvc output here right? Or do we expect no match as can be seen in the libstdc++
?
template
{ std::string s = "ab\nb"; check(s.cbegin() + 1, s.cend(), "^b", true, true); }
- Meanwhile, I made a draft PR. Please let me know if any of the test cases do not seem correct OR any additional test cases we could test.
It seems
match_prev_avail
implementation is regressed after fixed issue #41544.test cases:
GCC (msys2 mingw64 13.2.0):
MSVC (2022 17.8.3):
llvm-mingw 17.0.6 https://github.com/mstorsjo/llvm-mingw/releases/tag/20231128
based on https://sourceforge.net/p/scintilla/bugs/2405/?page=1#442d/89d2, libc++ (Xcode 15.0.1) on macOS also fails.