Closed Zac-HD closed 4 months ago
Haha, I encountered some flavors of this while implementing the case transformation functions.
While working on this, I discovered that the re.IGNORECASE
behavior differs significantly from what .lower()
and friends do. Amusingly, this means that you can't lowercase a string and expect it to match the same case-insensitive regex. Fun!
At any rate, I think I've got something covering these cases in v0.0.64, but always hungry for more counterexamples!
0.0.64 looks fantastic, I'm really excited to see what happens when I update the run-crosshair-in-CI branch!
always hungry for more counterexamples!
and oh boy is Hypothesis integration good news for you then 🤣
and oh boy is Hypothesis integration good news for you then 🤣
Hahaha; you honestly have no idea how happy all this is making me!
Found via https://github.com/HypothesisWorks/hypothesis/pull/4034,
tests/nocover/test_regex.py::test_case_insensitive_not_literal_never_constructs_multichar_match
fails withUnfortunately there are several unicode characters where upper-casing or lower-casing them can give you multiple codepoints; for example
chr(223)
('ß'.upper() == 'SS'
), orchr(304)
('İ'.lower() == 'i̇'
; you can't see it but there's a combining-dot codepoint there too). The turkish capital I with dot above is particularly cursed because normalizing doesn't roundtrip, hence being excluded in the test above.(this is a terrible thing to report to a friend; but that's Unicode for you 😿)