Closed skmnktl closed 3 years ago
Why are we using position to identify character equivalences across character sets? Isn't a direct mapping more stable and easier to maintain?
Agreed! I feel something like https://github.com/indic-transliteration/indic_transliteration_scala/blob/dfabc9242cf556a07339f304c5e195b8443e2a40/src/main/scala/sanskritnlp/transliteration/indic/northern.scala#L16 is superior.
Some thing roughly along the following lines in YAML:
character_type: "vowel"
- baraha full: "a" half: "a"
- devanagari full: "अ" half: "ा"
My order of preference: TOML > JSON5 > JSON / YAML.
If it's just time/pain of converting existing code, I'll add it to my things to do for the future list.
Go for it - subject to :
Great, let's perhaps:
To start with, I propose
The actual filling out of the map will have to happen as time permits.
Great, let's perhaps:
- Establish that agreement you mention above.
Start with something like what I pointed to in https://github.com/indic-transliteration/common_maps/issues/5#issuecomment-886081337 , but be prepared to change it if superior ideas come along before you finish. That should be some easy tweak for your "scheme-transformer" script.
- I can get a sense of what "fix all code" means. While I can commit to doing it, it will have to be in discrete steps.
Not sure what you're expecting here - but you'll have to fix https://github.com/indic-transliteration/indic_transliteration_py/blob/d12395c91d1a62ba088c01bb8ea6ab7b69763e37/indic_transliteration/sanscript/__init__.py#L108 in case of py, and the corresponding function in case of js. Then make sure that the tests continue to pass.
To start with, I propose
- we take BARAHA, IAST and DEVANAGARI and start our prototype map.
- write the code (I can do python, but JS is going to be tough.)
Good idea.
The actual filling out of the map will have to happen as time permits.
Not sure what you mean by "filling out of the map" - some "scheme-transformer" script should do it in a jiffy.
[0] [0.devanagari] f = ["अ"] h = [""] [0.baraha] f = ["a"] h = ["a"] [0.iso15919] f = ["a"] h = ["a"]
[1] [1.devanagari] f = ["आ"] h = ["ा"] [1.baraha] f = ["A", "aa"] h = ["A", "aa"] [1.iso15919] f = ["ā"] h = ["ā"]
You mean a single giant file encompassing all scripts? No no. Separate file for each script. (Sorry to have misunderstood earlier.)
Why not? There are several advantages to it: Additions and updates are much easier.
The biggest disadvantage I can think of is having to parse through the file to read / find each character might slow down transliteration. But that's easy enough to fix. Once the map is ready, each time we transliterate from some charset X to Y, we generate a dictionary mapping X to Y.
E.g. {"अ": 'a', "आ":'ā'} for deva to iast from the snippet above.
Why not? There are several advantages to it: Additions and updates are much easier.
I disagree. Much more scrolling. Short files are far easier to fix. Eg. See recent PR-s in this repo.
@arvindd - Any objection if we switch to TOML? See files at https://github.com/indic-transliteration/indic_transliteration_py/tree/a4166dcc3608d6a36e385029aa6fddc98ec81d03/indic_transliteration/sanscript/schemes/data_toml .
Motivation:
(FWIW, I've added a migrator at https://github.com/indic-transliteration/indic_transliteration_py/blob/master/indic_transliteration/sanscript/schemes/migrator.py )
@vvasuki Apologies! I thought it was a straight look up and didn't anticipate things breaking with the addition of new keys.
That said-- while we're doing all this anyway-- would the following format address any concerns? A lot less scrolling, and we keep like chars in the same file.
I hacked the file together using the table on wiki, so it's missing half-forms and the like. But I figured I might as well be vocal while we're thinking about this migration. To me the migrated toml files linked above don't address the issues I'm concerned about: namely using position to identify equivalent characters.
[0] ISO = "a" 7bitISO = "a" Deva = "अ" Aran = "اَ\u200e" Beng = "অ" Guru = "ਅ" Gujr = "અ" Orya = "ଅ" Taml = "அ" Telu = "అ" Knda = "ಅ" Mlym = "അ" Sinh = "අ" Brha = "a"
[1] ISO = "ā" 7bitISO = "aa" Deva = "आ" Aran = "آ\u200e" Beng = "আ" Guru = "ਆ" Gujr = "આ" Orya = "ଆ" Taml = "ஆ" Telu = "ఆ" Knda = "ಆ" Mlym = "ആ" Sinh = "ආ" Brha1 = ["A", "aa"]
That said-- while we're doing all this anyway-- would the following format address any concerns? A lot less scrolling, and we keep like chars in the same file.
Better to have one file per script, rather than one per character. Often one wants to fix the map for a script while looking at ligatures for various characters in that script without reference to other scripts.
Just updated the proposed map - https://github.com/indic-transliteration/indic_transliteration_py/commit/af223f06e235eeacba019be12bcf1c01b67ec031 . Perhaps this is what you seek? It is clearer and less hairy to modify.
This is much better, but I'm suggesting something slightly different. Each file should represent a single character. So the file below is named "0", but it represents the character "a". This avoids one specific issue-- if a character doesn't exist in devanagari, we don't have to do some contrived thing to get it in there.
Character sets mappings aren't bijective functions, so rather than picking a single domain that we map everything to, I prefer defining equivalence classes. This makes it easier to define defaults, i.e. take unions of incomplete domains; for example, if I am transliterating from Kannada to Telugu, this mapping requires me to move from Kannada to Deva to Telugu. That can be awkward for the vowel "L" we have in telugu and kannadam; ಌ | ೡ are unique to these languages. WIth an equivalence class for that character, the move is just direct. And if a character doesn't exist for a given range, we just order the existing ranges to default to a character in a domain that the mapping exists for.
So in the example below, suppose we're mapping from Malayalam to Urdu, since the latter character doesn't exist, we have an order of preference, say ISO>Aran>Beng. It finds the first of these and uses that. It's also much easier to add to. Grep some character in an encoding you know you have, and add the new char to it. So if you wanted to add SLP1 to this list, it's just SLP="a" to 0.toml and save the file.
File 0.toml [0] ISO = "a" 7bitISO = "a" Deva = "अ" Aran = "اَ\u200e" Beng = "অ" Guru = "ਅ" Gujr = "અ" Orya = "ଅ" Taml = "அ" Telu = "అ" Knda = "ಅ" Mlym = "അ" Sinh = "අ" Brha = "a"
One more point: The mapping I'm proposing is a class of functions of which the ones you defined are a subset: f: Deva -> SCRIPT_X.
This is much better, but I'm suggesting something slightly different. Each file should represent a single character.
I understood what you're suggesting. Just that it is inferior from the perspective of maintenance and contributions. If I told someone who wants to contribute a new map - "mm - edit these 56 files" (even if the edits are small), imagine how that person will feel. Contrast this with what you did with the baraha map.
And it is irritating to have to navigate through lot of guck one does not care about (representations in other scripts) in order to fix something.
More precise mathy explanation maybe?
Character sets mappings aren't bijective functions, so rather than picking a single domain that we map everything to, I prefer defining equivalence classes. This makes it easier to define defaults, i.e. take unions of incomplete domains; for example, if I am transliterating from Kannada to Telugu, this mapping requires me to move from Kannada to Deva to Telugu.
"Kannada to Deva to Telugu" is transfromed into "Kannada to Telugu" within the programs. Our definitions are simpler - and minimal - meant to be more human friendly than machine friendly. It's up to the consumer to get efficient mappings (which by the way are stored in memory for speed in case of python) out of these.
That can be awkward for the vowel "L" we have in telugu and kannadam; ಌ | ೡ are unique to these languages.
Who told you this? These letters are present in devanAgarI as well - ऌ ॡ । More generally the assumption here is that we can write ANY sound (in any Indic language we care about) in devanAgarI, with the use of nuktas etc..
Example of Issue
So in the example below, suppose we're mapping from Malayalam to Urdu, since the latter character doesn't exist, we have an order of preference, say ISO>Aran>Beng.
That's not hard to do no matter what representation we use to define scripts/ schemes. Implementation issue.
Got it. We're now in the realm of personal preference I guess? I prefer the lower cognitive load of 56 files each with the same data. And the idea of devanāgarī primacy irks me. But a second look at all the files tells me that all files have all the characters, so we're good I guess.
@arvindd - Any objection if we switch to TOML? ~See files at https://github.com/indic-transliteration/indic_transliteration_py/tree/a4166dcc3608d6a36e385029aa6fddc98ec81d03/indic_transliteration/sanscript/schemes/data_toml~ See https://github.com/indic-transliteration/common_maps/issues/5#issuecomment-888816053
Awaiting response for above from @arvindd and reopening with changed issue title.
@arvindd - Any objection if we switch to TOML? See files at https://github.com/indic-transliteration/indic_transliteration_py/tree/a4166dcc3608d6a36e385029aa6fddc98ec81d03/indic_transliteration/sanscript/schemes/data_toml .
Motivation:
- TOML is simple, but allows comments and is more readable.
- Comments are sometimes desirable, but @skmnktl 's recent baraha map (which I unwittingly accepted) broke both js and py tests because of added comment fields.
- Good coverage of languages - https://github.com/toml-lang/toml/wiki - though I don't see F# there. If really necessary, we can have an automated CI job to produce json on a separate branch.
Sure @vvasuki, no problems moving to TOML. I agree that's the best format too, especially because it supports comments. There are atleast two C# parsers for TOML - so any .NET language (which includes F#) can easily use that - so there probably is no need to also create JSONs from TOML. Of course, if I get into any problems there (i.e., if those parsers are not good / half implemented / etc.), we always have JSON to fall back to as you mentioned.
Thanks for notifying me!
It was hairier than expected, but switched to TOML now. JS and PY tests pass.
Why are we using position to identify character equivalences across character sets? Isn't a direct mapping more stable and easier to maintain? Some thing roughly along the following lines in YAML:
character_type: "vowel"
baraha full: "a" half: "a"
devanagari full: "अ" half: "ा"
If it's just time/pain of converting existing code, I'll add it to my things to do for the future list.