rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts
http://hyperglot.rosettatype.com
GNU General Public License v3.0
162 stars 22 forks source link

Thai data #80

Closed lianghai closed 2 years ago

lianghai commented 2 years ago

I was trying to utilize the Hyperglot data, then noticed that the Thai data doesn’t look right:

https://github.com/rosettatype/hyperglot/blob/7b4938b8453d118257c0277c5f04fd7c71afab46/lib/hyperglot/hyperglot.yaml#L9253-L9272

I know it’s still “primary” and to be reviewed. Before suggesting changes, I’d like to first understand how the current data was created, because it cites Omniglot, Wikipedia, and CLDR as sources, but apparently it differs a lot from what those sources provide (note the lack of many above-base vowel/tone marks). For example, the CLDR data:

https://github.com/rosettatype/hyperglot/blob/7b4938b8453d118257c0277c5f04fd7c71afab46/other/cldr.yaml#L3469-L3483

Is there somewhere I can learn about how the current Thai data was originally created? Or should we just forget about it and start from what CLDR has?

kontur commented 2 years ago

Thanks @lianghai for the information. I cannot speak as to what the original data is based on beyond the cited references. Form a quick look it seems the base characters are identical, only CLDR has half a dozen or so marks listed Hyperglot is missing.

@MrBrezina and @sergiolmartins can comment more on the language data, but I am sure we are happy to review an amended list of marks. Additional information with what those marks represent and how they are used would be helpful. Hyperglot, generally speaking, treats marks as required for language support, so the threshold should indeed be that these are required to write the language.

@lianghai are you aware of an official standards body for Thai orthography, or academic resources that would carry equally normative value? It would be great to upgrade the status and usually a somewhat official resource to cross-check is required for this.

lianghai commented 2 years ago

Those missing marks are apparently required to the eyes of anyone who knows the Thai script. My intention is merely to see if there’s a quick path for me to bring Hyperglot’s Thai data to be at least on the same level with the three somehow cited references. Others can then refine it with authoritative sources later.

MrBrezina commented 2 years ago

@lianghai sorry, it took a while for me to get back to this one. The marks should be there and I remember adding them. Perhaps some canonisation processes removed them. Please, do update these. I think @kontur can merge it later this week. Or you know what, to make it faster, I will make the obvious changes and you can correct if there is any issue left.

kontur commented 2 years ago

Thanks, this is released as part of 0.3.9.