pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.07k stars 1.73k forks source link

Implement `.str.to_casefold` to allow case insensitive comparison between strings. #6782

Open StefanBRas opened 1 year ago

StefanBRas commented 1 year ago

Problem description

With the current functions it is not possible to do a correct case insensitive string comparison. You can get a wrong result if you use .str.lower for example:

(
pl.DataFrame({'col1': ['straße'], 'col2': ['STRASSE']})
.select([pl.col('col1').str.to_lowercase() == pl.col('col2').str.to_lowercase()])
) # -> False
'straße'.casefold() == 'STRASSE'.casefold() # -> True

Python has it build in as str.casefold which returns a case-folded version of the string. Pandas has it Series.str.casefold.

The algorithm is specified in the Unicode Standard 3.13 (PDF link).

It seems like Rust does not have this built in and the best bet would be either the [focaccia crate] https://crates.io/crates/focaccia) which is the most recently updated one or caseless which is more used but less documented and recently updated.

ritchie46 commented 1 year ago

I feel very much that we should build an extension package for these exotic cases. With https://github.com/pola-rs/pyo3-polars in place, we could very easily make specialization libraries that can be installed opt-in.

StefanBRas commented 1 year ago

I have no strong feelings either way. I do however feel like the most common usage of str.to_lower and str.to_upper is to actually make (wrong) case insensitive comparisons, so including str.to_casefolded is as exotic as those.

gorkaerana commented 1 year ago

I think this is not an exotic use case as Python's standard library recommends str.casefold for correct string comparison.

stinodego commented 4 months ago

I think this would be a useful addition.