Internal error for case-insensitive regex where changing case can change string length

Zac-HD commented 2 months ago

Found via https://github.com/HypothesisWorks/hypothesis/pull/4034, tests/nocover/test_regex.py::test_case_insensitive_not_literal_never_constructs_multichar_match fails with

  File ".../crosshair/libimpl/relib.py", line 101, in single_char_mask
    ret = CharMask([ord(chr(arg).lower()), ord(chr(arg).upper())])
TypeError: ord() expected a character, but string of length 2 found
while generating 'Draw 1: ' from from_regex(re.compile(r'[^İ]+', re.IGNORECASE|re.UNICODE), fullmatch=True)

Unfortunately there are several unicode characters where upper-casing or lower-casing them can give you multiple codepoints; for example chr(223) ('ß'.upper() == 'SS'), or chr(304) ('İ'.lower() == 'i̇'; you can't see it but there's a combining-dot codepoint there too). The turkish capital I with dot above is particularly cursed because normalizing doesn't roundtrip, hence being excluded in the test above.

(this is a terrible thing to report to a friend; but that's Unicode for you 😿)

pschanely commented 1 month ago

Haha, I encountered some flavors of this while implementing the case transformation functions.

While working on this, I discovered that the re.IGNORECASE behavior differs significantly from what .lower() and friends do. Amusingly, this means that you can't lowercase a string and expect it to match the same case-insensitive regex. Fun!

At any rate, I think I've got something covering these cases in v0.0.64, but always hungry for more counterexamples!

Zac-HD commented 1 month ago

0.0.64 looks fantastic, I'm really excited to see what happens when I update the run-crosshair-in-CI branch!

always hungry for more counterexamples!

and oh boy is Hypothesis integration good news for you then 🤣

pschanely commented 1 month ago

and oh boy is Hypothesis integration good news for you then 🤣

Hahaha; you honestly have no idea how happy all this is making me!

pschanely / CrossHair

Internal error for case-insensitive regex where changing case can change string length #274