Documentation about Unicode normalization

magiclen / str-utils

This crate provides some traits to extend types which implement `AsRef<[u8]>` or `AsRef<str>`.

MIT License

1 stars 0 forks source link

Documentation about Unicode normalization #2

Open Chaoses-Ib opened 12 months ago

Chaoses-Ib commented 12 months ago

unicase doesn't apply Unicode normalization to strings (https://github.com/seanmonstar/unicase/issues/48). eq_ignore_case can be wrong in some cases, for example:

assert!("Åström".eq_ignore_case("Åström"))
// assertion failed: \"Åström\".eq_ignore_case(\"Åström\")

Unicode normalization can be done using https://github.com/unicode-org/icu4x or https://github.com/unicode-rs/unicode-normalization. However, it is a bit complex and may hurt performance. If you don't want to do it, at least adding some warnings in the documentation would be good for users.

magiclen commented 11 months ago

I was not aware that ö and ö are encoded differently. They look the same.

However, even when using the built-in equal method, they are determined to be not equal.

assert_eq!("ö", "ö"); // assertion failed

Why should they be considered equal when performing a case-insensitive comparison?

Chaoses-Ib commented 11 months ago

Yeah, that makes sense. Doing normalization or not should depend on the use case. Chromium does normalization when searching text on the page. According to this article, Windows and Linux on ext4 don't do normalization to file names, but macOS does it, and Linux on ZFS does it based on user config.

Adding another version of functions that can do normalization may be the real workable way, like eq_norm and eq_norm_ignore_case.