Open jwilk opened 6 years ago
Sounds reasonable.
I haven't done so already only because letter form blocks that are not substantially complete lead to unfun. I wonder what to do for missing letters.
Combining characters are not an option, even when placed above U+0020 as a carrier.
Also, there's no capital subscripts at all — would you think it's better to convert to lowercase or to leave them unhandled?
It's better to keep uppercase letters intact.
So I did most of the work manually... and only during adding those weird IPA letters realized that UnicodeData has:
02B0;MODIFIER LETTER SMALL H;Lm;0;L;<super> 0068;;;;N;;;;;
which makes generating the whole table an one-liner...
But then, it turns out there's a code problem. The current algorithm assumes that both sides of the conversion can be freely lower/uppercased, which works fine but only as long as either both cases exist or case-mangled conversion is ok (unicameral scripts). Not the case here...
Fixing this would require either goofy tagging or moving case handling from runtime to building conversion tables (the latter might speed-up the program, too). That'd require some effort, thus I just committed the data to a branch, dropping this issue to the bottom of my TODO list... :(
I wonder though, perhaps it might be better to fix the root issue of some {super,sub}scripts missing upstream?
I'd like tran to support subscripts and superscripts: https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Superscripts_and_subscripts_block