princjef / regex-rust

A regular expression library implemented natively in Rust using Pike VM
MIT License
2 stars 0 forks source link

Simple case folding not fully implemented #8

Open princjef opened 10 years ago

princjef commented 10 years ago

The to_uppercase() and to_lowercase() functions in the Rust standard library don't provide any means of linking up groups of more than two characters that are all equivalent when case insensitive matching is on.

For example, there are two different lowercase characters for the Greek letter sigma. Under the official rules for case folding (case insensitive matching), if one of these characters is in a regular expression, a case insensitive match should not only match its uppercase version, but also it's other lowercase version, yielding three possible characters in the case insensitive match. The Rust standard library only maps the lowercase to their corresponding uppercase letter and the uppercase only maps to one of the lowercase letters. This makes finding all three from any arbitrary character impossible in reasonable time (such an algorithm would be linear in the number of total Unicode characters for each character in the regular expression). See this gist for an example.

Since our library's case folding is backed by these library functions, it is very difficult to do proper simple case folding. This might be worth looking into down the road, but not an immediate issue.