Closed alerque closed 2 years ago
This is much more messed up than this PR corrects. The Romanian derivations seem to have the correct alternates, but all the Turkic ones are messed up in one way or another.
Thank you, Caleb. We will merge this in and set up a new issue (and investigation) in the other Turkic languages. cc: @kontur
@alerque you have actually corrected this in the Latin Plus which was there only for reference. The database is in lib/hyperglot/hyperglot.yaml
. I can edit it myself, no problem.
Happy for a PR on this, but let's not change the original reference data; if anything, we can add a corrected version of the latin plus data set, if it has been addressed there.
Presumably this applies to Turkish and Ottoman Turkish — any others?
Yes, others. I started going down the rabbit hole farther after opening this and edited some of the YAML files. I'll push that commit here just so it isn't lost, but it is incomplete.
I'd be happy to make the fix for real if you let me know which data set is canonical vs. which are derived.
At the very least Turkish, Ottoman Turkish, Gagavuz, Kurdish (Latin), and Turkmen are affected, but that isn't an exhaustive search. I stopped looking when I realized that there were so many and I didn't know what data I was supposed to be editing.
hyperglot.yaml is the source of truth :)
If you have the package installed you can also run hyperglot-validate
(checks the data in the yaml) and hyperglot-save
(enforces some sortings and for example mark related things).
Happy to answer questions or give pointers. We appreciate you input 👍
In that case let me co through the YAML file a bit more and clean up the bits I'm sure of and maybe comment on some ones I suspect. I'll force-push and tag for review when that's done.
I have updated this PR to be the two bits I'm pretty confident on.
For Ottoman Turkish the glyph list is just completely foobared as far as I can tell. To the best of my knowledge there are three common ways to Romanize Ottoman Turkish: using the modern Turkish orthography, using the IJMES transliteration system, or using the ALA- LC rules shown in the chart on Wikipedia.
The glyph list in the hyperglot data set are is a jumble of all three with some from each. This PR should complete the set for using modern Turkish (although it might be better to accomplish that with an include rather than listing them again). It also includes fragments from the other two, but neither are complete. I don't know what the goal is here. List all of them from all popular competing Romanization schemes? Anyway I left that for another commit or PR so that this one can get reviewed and moved along since the error in modern Turkish is pretty bad.
Thanks. I think for both Turkish and Ottoman Turkish (and probably others we can identify) the requirement for base
should be current established norm, so using the cedilla-variants.
I think it would be useful to include the comma-variants as auxiliary
to do justice to the reality that those may be encountered when typesetting those languages. At the same time, this would denote the cedilla-variants as preferred. I can add this when merging the PR next week, or feel free to add this still.
The Ottoman transliteration is or should be based on https://www.cambridge.org/core/journals/international-journal-of-middle-east-studies/information/author-resources/ijmes-translation-and-transliteration-guide — this should be added as an actual source in the yaml, not just as note
. I will cross-check against e.g. https://www.cambridge.org/core/services/aop-file-manager/file/57d83390f6ea5a022234b400/TransChart.pdf to un-foobar it, if it currently is.
Thanks again @alerque for the contribution. The changes to Turkish are now published in 0.3.8
as well as on the Hyperglot website. We'll review the other related languages bit by bit.
Closes #71