llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.02k stars 11.96k forks source link

[libc++] `std::regex` `match_prev_avail` implementation is regressed #74838

Open zufuliu opened 11 months ago

zufuliu commented 11 months ago

It seems match_prev_avail implementation is regressed after fixed issue #41544.

test cases:

#include <string>
#include <regex>
#include <iostream>
using namespace std;

int main() {
    string s = "ab";
    string::iterator start = s.begin() + 1;
    regex re0("^");
    cout << "^" << endl;
    cout << regex_search(start, s.end(), re0, regex_constants::match_default) << endl;
    cout << regex_search(start, s.end(), re0, regex_constants::match_not_bol) << endl;
    cout << regex_search(start, s.end(), re0, regex_constants::match_prev_avail) << endl;
    cout << regex_search(start, s.end(), re0, regex_constants::match_prev_avail | regex_constants::match_not_bol) << endl;

    // https://github.com/llvm/llvm-project/issues/41544
    regex re1("^ab");
    cout << "\n\n" << "^ab" << endl;
    cout << regex_search(s, re1, regex_constants::match_default) << endl;
    cout << regex_search(s, re1, regex_constants::match_not_bol) << endl;
    cout << regex_search(s, re1, regex_constants::match_prev_avail) << endl;
    cout << regex_search(s, re1, regex_constants::match_prev_avail | regex_constants::match_not_bol) << endl;

    regex re2("^b");
    cout << "\n\n" << "^b" << endl;
    cout << regex_search(start, s.end(), re2, regex_constants::match_default) << endl;
    cout << regex_search(start, s.end(), re2, regex_constants::match_not_bol) << endl;
    cout << regex_search(start, s.end(), re2, regex_constants::match_prev_avail) << endl;
    cout << regex_search(start, s.end(), re2, regex_constants::match_prev_avail | regex_constants::match_not_bol) << endl;
    return 0;
}

GCC (msys2 mingw64 13.2.0):

D:\notepad2>g++ -std=c++17 -Wall -Wextra test.cpp

D:\notepad2>a.exe
^
1
0
0
0

^ab
1
0
0
0

^b
1
0
0
0

MSVC (2022 17.8.3):

D:\notepad2>cl /nologo /EHsc /std:c++17 /W4 test.cpp
test.cpp

D:\notepad2>test.exe
^
1
0
0
0

^ab
1
0
0
0

^b
1
0
0
0

llvm-mingw 17.0.6 https://github.com/mstorsjo/llvm-mingw/releases/tag/20231128

D:\notepad2>clang++ -std=c++17 -Wall -Wextra test.cpp

D:\notepad2>a.exe
^
1
0
1
1

^ab
1
0
1
1

^b
1
0
1
1

based on https://sourceforge.net/p/scintilla/bugs/2405/?page=1#442d/89d2, libc++ (Xcode 15.0.1) on macOS also fails.

SanjayMarreddi commented 10 months ago

I am going to work on this! ( I'm hoping that no one is working on it yet. In case yes, please lmk. )

SanjayMarreddi commented 10 months ago

Hi, I was looking to fix this and need a small clarification.

// Q: There is no match_not_bol and match_not_bow to ignore. --first is a valid iteration position. So why it should be 0?

cout << regex_search(start, s.end(), re0, regex_constants::match_prev_avail) << endl;

// Q: There is match_not_bol to ignore. Ignoring it makes sure that we can match ^ with [first,first). --first is a valid iteration position. So why it should be 0?

cout << regex_search(start, s.end(), re0, regex_constants::match_prev_avail | regex_constants::match_not_bol) << endl;



I think I'm misunderstanding this part from reference. `--first is a valid iterator position`. Please let me know what I'm missing. Thanks!
zufuliu commented 10 months ago

Updated test to show matched position and length, https://godbolt.org/z/M6K3qraGh

include <string>
#include <regex>
#include <iostream>
using namespace std;

void show_match_result(const smatch &match) {
    if (match.empty()) {
        cout << false << endl;
    } else {
        cout << true << ": " << match.position(0) << ", " << match.length(0) << endl;
    }
}

// https://github.com/llvm/llvm-project/issues/74838
int main() {
    string s = "ab";
    string::const_iterator start = s.cbegin() + 1;
    regex re0("^");
    smatch match;
    cout << "^" << endl;
    regex_search(start, s.cend(), match, re0, regex_constants::match_default);
    show_match_result(match);
    regex_search(start, s.cend(), match, re0, regex_constants::match_not_bol);
    show_match_result(match);
    regex_search(start, s.cend(), match, re0, regex_constants::match_prev_avail);
    show_match_result(match);
    regex_search(start, s.cend(), match, re0, regex_constants::match_prev_avail | regex_constants::match_not_bol);
    show_match_result(match);

    // https://github.com/llvm/llvm-project/issues/41544
    regex re1("^ab");
    cout << "\n\n" << "^ab" << endl;
    regex_search(s, match, re1, regex_constants::match_default);
    show_match_result(match);
    regex_search(s, match, re1, regex_constants::match_not_bol);
    show_match_result(match);
    regex_search(s, match, re1, regex_constants::match_prev_avail);
    show_match_result(match);
    regex_search(s, match, re1, regex_constants::match_prev_avail | regex_constants::match_not_bol);
    show_match_result(match);

    regex re2("^b");
    cout << "\n\n" << "^b" << endl;
    regex_search(start, s.cend(), match, re2, regex_constants::match_default);
    show_match_result(match);
    regex_search(start, s.cend(), match, re2, regex_constants::match_not_bol);
    show_match_result(match);
    regex_search(start, s.cend(), match, re2, regex_constants::match_prev_avail);
    show_match_result(match);
    regex_search(start, s.cend(), match, re2, regex_constants::match_prev_avail | regex_constants::match_not_bol);
    show_match_result(match);
    return 0;
}
SanjayMarreddi commented 9 months ago

Can someone help me understand what I am missing here or point me to any reference. Thanks!

ldionne commented 9 months ago

@SanjayMarreddi

Your first question:

Q: There is no match_not_bol and match_not_bow to ignore. --first is a valid iteration position. So why it should be 0?

std::string s = "ab";
std::regex_search(s.begin() + 1, s.end(), std::regex("^"), std::regex_constants::match_prev_avail)

Reproducer: https://godbolt.org/z/Msec8Mcjz

My understanding of libstdc++'s behavior is that they never match "^" when match_prev_avail is used, because then they "know" that first is not the true beginning of the sequence (in the sense that a previous iterator position is available). So they make ^ never match the beginning of a sequence in that case. I'm not certain that makes sense, and IMO the libc++ behavior is the one that seems correct to me, but I can't back this with any reference. TLDR, we have the same understanding at the moment and I'd be curious to hear what @jwakely thinks about this.

Your second question:

Q: There is match_not_bol to ignore. Ignoring it makes sure that we can match ^ with [first, first). --first is a valid iteration position. So why it should be 0?

std::string s = "ab";
std::regex_search(start, s.end(), std::regex("^"), std::regex_constants::match_prev_avail | std::regex_constants::match_not_bol)

Reproducer: https://godbolt.org/z/vrTo3zzT4

Same reasoning here, this seems valid to me.

@zufuliu Can you explain why you think libc++ has the wrong behavior here? Here's a slightly more verbose version of your test cases after playing around with it: https://godbolt.org/z/hTfGra1s8

zufuliu commented 9 months ago

Can you explain why you think libc++ has the wrong behavior here?

Just my understanding for match_prev_avail, i think it's an efficient (only dereference --first as needed) replacement for following snippet:

std::regex_constants::match_flag_type flag = ...;
const char chPrev = *--first;
if (regex multiline mode) {
  if (chPrev != '\n' && chPrev != '\r') {
    flag |= std::regex_constants::match_not_bol;
  }
} else {
  flag |= std::regex_constants::match_not_bol;
}
if (isalnum(chPrev)) {
  flag |= std::regex_constants::match_not_bow;
}

as match_not_bol and match_not_bow can be detected from --first, so they are ignored from user supplied argument.

updated test case (added one multiline test, still lack test for match_prev_avail + match_not_bow), https://godbolt.org/z/zc4GsGEKM MSVC:

"^" matching "b"
match: position 0, length 0
no match
no match
no match

"^ab" matching "ab"
match: position 0, length 2
no match

"^b" matching "b"
match: position 0, length 1
no match
no match
no match

"^b" matching "b
b"
match: position 0, length 1
match: position 2, length 1
match: position 2, length 1
match: position 2, length 1
ldionne commented 9 months ago

I must admit I don't understand the purpose of match_prev_avail. I read the spec and I understand what it mentions about ignoring match_not_bow and match_not_bol, but the behavior seems somewhat underspecified to me and I'm far from a <regex> expert so I don't have a lot of context.

I looked at Boost.Regex from which the Standard regex was originally created, and while this is by no means authoritative, it seems that Boost.Regex agrees with libstdc++'s interpretation: https://godbolt.org/z/cY64zrhfr

In other words, it does look like match_prev_avail is intended to convey that we're in fact not really matching on [first, last), but on [first-1, last) instead. Under that interpretation, it is clear that "^" should never match when match_prev_avail is used.

@SanjayMarreddi I think it's safe to treat this issue by assuming that libstdc++'s interpretation is the right one.

SanjayMarreddi commented 9 months ago

template void check(It start, It end, char const* regex, bool prev_avail, bool multiline=false) { std::smatch match; std::regex re(regex, multiline ? std::regex::multiline : std::regex::ECMAScript); std::regex_search(start, end, match, re, std::regex_constants::match_not_bol); show_match_result(match); }

{ std::string s = "ab\nb"; check(s.cbegin() + 1, s.cend(), "^b", true, true); }


- Meanwhile, I made a draft PR. Please let me know if any of the test cases do not seem correct OR any additional test cases we could test.