nalgeon / sqlean

The ultimate set of SQLite extensions
MIT License
3.75k stars 120 forks source link

unicode normalization #10

Closed rurban closed 3 years ago

rurban commented 3 years ago

Unaccent is a nice feature, but fails with denormalized ordered sequences, and on all non-mark sequences, such as all non-european languages. There really should be normalization step added, like NFD, and maybe even add a field to cache this NFD string and a flag if already done (and equal as in 95% of all cases).

nalgeon commented 3 years ago

Yeah, probably. Unfortunately, I'm not nearly as good with Unicode (or C programming) as necessary to even try adding that.