sheredom / utf8.h

📚 single header utf8 string functions for C and C++
The Unlicense
1.71k stars 122 forks source link

utf8lwrcodepoint for "Greek Capital Theta Symbol" incorrect #49

Closed abylouw closed 3 years ago

abylouw commented 6 years ago

https://github.com/sheredom/utf8.h/blob/89a1a24e802e0ae231e2e0db13869eda0f32315c/utf8.h#L1159

The correct lower case for u+03f4 is u+03b8

https://www.compart.com/en/unicode/U+03F4 https://codepoints.net/U+03F4?lang=en

sheredom commented 6 years ago

Well this is just awful:

And inversely:

So basically we've got multiple codepoints -> a single codepoint, which means I have to arbitrated and decide which codepoint should be picked! I'll have to spend some brain cycles on this...

abylouw commented 6 years ago

I think the issue is that Unicode case mappings are not reversible (https://www.unicode.org/faq/casemap_charprop.html#7) and therefore you will need two tables, one for upper to lower and one for lower to upper.

rurban commented 5 years ago

In musl I implemented it as a single sorted list, with certain entries just for upper, but lower was the preferred op. See https://github.com/rurban/musl/commit/bd9f1e60ac55143c507c767ba070ab99a5760baa

mileder commented 4 years ago

Hi sheredom, really nice work! But i am missing the cyrillic char support for Upper/Lower Case operations. Is it possible to implement this? One of my native languages is russian and i would help for any question regarding cyrillic chars.

sheredom commented 4 years ago

@mileder yeah its possible - I happily accept PRs for this kind of thing.

You can see https://github.com/sheredom/utf8.h/pull/47 as an example of how I did it + the tests for greek characters.