Replace [Ss]commaaccent with [Ss]cedilla as required for Turkic languages

rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts

http://hyperglot.rosettatype.com

GNU General Public License v3.0

162 stars 22 forks source link

Replace [Ss]commaaccent with [Ss]cedilla as required for Turkic languages #72

Closed alerque closed 2 years ago

alerque commented 2 years ago

Closes #71

alerque commented 2 years ago

This is much more messed up than this PR corrects. The Romanian derivations seem to have the correct alternates, but all the Turkic ones are messed up in one way or another.

MrBrezina commented 2 years ago

Thank you, Caleb. We will merge this in and set up a new issue (and investigation) in the other Turkic languages. cc: @kontur

MrBrezina commented 2 years ago

@alerque you have actually corrected this in the Latin Plus which was there only for reference. The database is in lib/hyperglot/hyperglot.yaml. I can edit it myself, no problem.

kontur commented 2 years ago

Happy for a PR on this, but let's not change the original reference data; if anything, we can add a corrected version of the latin plus data set, if it has been addressed there.

Presumably this applies to Turkish and Ottoman Turkish — any others?

alerque commented 2 years ago

Yes, others. I started going down the rabbit hole farther after opening this and edited some of the YAML files. I'll push that commit here just so it isn't lost, but it is incomplete.

I'd be happy to make the fix for real if you let me know which data set is canonical vs. which are derived.

At the very least Turkish, Ottoman Turkish, Gagavuz, Kurdish (Latin), and Turkmen are affected, but that isn't an exhaustive search. I stopped looking when I realized that there were so many and I didn't know what data I was supposed to be editing.

kontur commented 2 years ago

hyperglot.yaml is the source of truth :)

If you have the package installed you can also run hyperglot-validate (checks the data in the yaml) and hyperglot-save (enforces some sortings and for example mark related things).

Happy to answer questions or give pointers. We appreciate you input 👍

alerque commented 2 years ago

In that case let me co through the YAML file a bit more and clean up the bits I'm sure of and maybe comment on some ones I suspect. I'll force-push and tag for review when that's done.

alerque commented 2 years ago

I have updated this PR to be the two bits I'm pretty confident on.

For Ottoman Turkish the glyph list is just completely foobared as far as I can tell. To the best of my knowledge there are three common ways to Romanize Ottoman Turkish: using the modern Turkish orthography, using the IJMES transliteration system, or using the ALA- LC rules shown in the chart on Wikipedia.

The glyph list in the hyperglot data set are is a jumble of all three with some from each. This PR should complete the set for using modern Turkish (although it might be better to accomplish that with an include rather than listing them again). It also includes fragments from the other two, but neither are complete. I don't know what the goal is here. List all of them from all popular competing Romanization schemes? Anyway I left that for another commit or PR so that this one can get reviewed and moved along since the error in modern Turkish is pretty bad.

kontur commented 2 years ago

Thanks. I think for both Turkish and Ottoman Turkish (and probably others we can identify) the requirement for base should be current established norm, so using the cedilla-variants.

I think it would be useful to include the comma-variants as auxiliary to do justice to the reality that those may be encountered when typesetting those languages. At the same time, this would denote the cedilla-variants as preferred. I can add this when merging the PR next week, or feel free to add this still.

The Ottoman transliteration is or should be based on https://www.cambridge.org/core/journals/international-journal-of-middle-east-studies/information/author-resources/ijmes-translation-and-transliteration-guide — this should be added as an actual source in the yaml, not just as note. I will cross-check against e.g. https://www.cambridge.org/core/services/aop-file-manager/file/57d83390f6ea5a022234b400/TransChart.pdf to un-foobar it, if it currently is.

kontur commented 2 years ago

Thanks again @alerque for the contribution. The changes to Turkish are now published in 0.3.8 as well as on the Hyperglot website. We'll review the other related languages bit by bit.