Closed lianghai closed 2 years ago
Thanks @lianghai for the information. I cannot speak as to what the original data is based on beyond the cited references. Form a quick look it seems the base characters are identical, only CLDR has half a dozen or so marks listed Hyperglot is missing.
@MrBrezina and @sergiolmartins can comment more on the language data, but I am sure we are happy to review an amended list of marks
. Additional information with what those marks represent and how they are used would be helpful. Hyperglot, generally speaking, treats marks
as required for language support, so the threshold should indeed be that these are required to write the language.
@lianghai are you aware of an official standards body for Thai orthography, or academic resources that would carry equally normative value? It would be great to upgrade the status and usually a somewhat official resource to cross-check is required for this.
Those missing marks are apparently required to the eyes of anyone who knows the Thai script. My intention is merely to see if there’s a quick path for me to bring Hyperglot’s Thai data to be at least on the same level with the three somehow cited references. Others can then refine it with authoritative sources later.
@lianghai sorry, it took a while for me to get back to this one. The marks should be there and I remember adding them. Perhaps some canonisation processes removed them. Please, do update these. I think @kontur can merge it later this week. Or you know what, to make it faster, I will make the obvious changes and you can correct if there is any issue left.
Thanks, this is released as part of 0.3.9
.
I was trying to utilize the Hyperglot data, then noticed that the Thai data doesn’t look right:
https://github.com/rosettatype/hyperglot/blob/7b4938b8453d118257c0277c5f04fd7c71afab46/lib/hyperglot/hyperglot.yaml#L9253-L9272
I know it’s still “primary” and to be reviewed. Before suggesting changes, I’d like to first understand how the current data was created, because it cites Omniglot, Wikipedia, and CLDR as sources, but apparently it differs a lot from what those sources provide (note the lack of many above-base vowel/tone marks). For example, the CLDR data:
https://github.com/rosettatype/hyperglot/blob/7b4938b8453d118257c0277c5f04fd7c71afab46/other/cldr.yaml#L3469-L3483
Is there somewhere I can learn about how the current Thai data was originally created? Or should we just forget about it and start from what CLDR has?