sheredom / utf8.h

📚 single header utf8 string functions for C and C++
The Unlicense
1.71k stars 122 forks source link

utf8lwr() - handle accented upper case vowels #45

Closed giampaolo closed 6 years ago

giampaolo commented 6 years ago

Currently utf8.h lowercase conversion only handles ASCII chars. It is not possible to handle accented uppercase and lowercase vowels like Á for example. For my specific purpose I would like to cover at least ÀÈÌÒÙ since they cover most Latin languages. For the time being I would also be OK monkey patching utf8.h myself. I suppose the change has to take place in here: https://github.com/sheredom/utf8.h/blob/1ca34ece0708a33a28011690fc8461d8c8259054/utf8.h#L1016 Any advice on how to do it? (I'm not a great C coder unfortunately :-)

sheredom commented 6 years ago

Thanks for your comment!

So this is my primary issue with bringing functionality like this in - I really do want to support it, I'm just scared to add some of the additional lwr/upr variants without adding them all!

You've identified the correct place in the code for where this needs to go - I'll go explore the utf8 codepoints and see if there is an easy way to do this transformation and get back to you.

giampaolo commented 6 years ago

Thanks for your fast reply. I understand your concern. FWIW I think utf8proc is a lib which is supposed to cover most cases.

r-lyeh commented 6 years ago

Don't simplify this. It is a major issue.

Spanish will need ÁÉÍÓÚ and ÑÜ, romanian adds ĂÂÎȘȚ, hungarian adds ÖŐŰ, polish adds ĄĆĘŁŃŚŹŻ and so on (...) And yet they are all still latin languages.

sheredom commented 6 years ago

Right - I can mostly follow the latin list here https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin_script to get at least the latin based languages working.

sheredom commented 6 years ago

I've given this a first bash in #46 - it was more work than I expected to add the latin case support to all the utf8case functions!

Please check out the PR.

giampaolo commented 6 years ago

Sweet, I will check this tomorrow. Thanks a lot for working on this, I think it's a valuable improvement.

sheredom commented 6 years ago

I've added the greek letters you requested in #47 - can you check it out?

sheredom commented 6 years ago

@giampaolo stated that #47 handled his requirement (https://github.com/sheredom/utf8.h/pull/47#issuecomment-364915972) so I'm closing this issue!

If you have any requests in future please get in touch though!

giampaolo commented 6 years ago

I'm trying other languages by using (python) unit tests in which I copy articles from wikipedia. Amongst others, Armenian, Bulgarian and Baskir languages have some issues.