unicode-rs / unicode-normalization

Unicode Normalization forms according to UAX#15 rules
https://unicode-rs.github.io/unicode-normalization
Other
158 stars 40 forks source link

Accent-stripping example? #84

Open ccleve opened 2 years ago

ccleve commented 2 years ago

Would it be possible to add an example of stripping accents to the documentation? (This is commonly needed for search applications.)

As I understand it, the right way to do this is to determine if each character IS_LETTER and in one of these Unicode blocks: LATIN_1_SUPPLEMENT, LATIN_EXTENDED_ADDITIONAL, LATIN_EXTENDED_A, LATIN_EXTENDED_B. If it is, then decompose it, remove any NON_SPACING_MARKs, and recompose.

I haven't been able to figure out if a character is a non-spacing mark or not.