indic-transliteration / common_maps

MIT License
10 stars 14 forks source link

Better script definitions/ maps #5

Closed skmnktl closed 3 years ago

skmnktl commented 3 years ago

Why are we using position to identify character equivalences across character sets? Isn't a direct mapping more stable and easier to maintain? Some thing roughly along the following lines in YAML:

If it's just time/pain of converting existing code, I'll add it to my things to do for the future list.

vvasuki commented 3 years ago

Why are we using position to identify character equivalences across character sets? Isn't a direct mapping more stable and easier to maintain?

Agreed! I feel something like https://github.com/indic-transliteration/indic_transliteration_scala/blob/dfabc9242cf556a07339f304c5e195b8443e2a40/src/main/scala/sanskritnlp/transliteration/indic/northern.scala#L16 is superior.

Some thing roughly along the following lines in YAML:

  • character_type: "vowel"

    • baraha full: "a" half: "a"
    • devanagari full: "अ" half: "ा"

My order of preference: TOML > JSON5 > JSON / YAML.

If it's just time/pain of converting existing code, I'll add it to my things to do for the future list.

Go for it - subject to :

skmnktl commented 3 years ago

Great, let's perhaps:

  1. Establish that agreement you mention above.
  2. I can get a sense of what "fix all code" means. While I can commit to doing it, it will have to be in discrete steps.

To start with, I propose

  1. we take BARAHA, IAST and DEVANAGARI and start our prototype map.
  2. write the code (I can do python, but JS is going to be tough.)

The actual filling out of the map will have to happen as time permits.

vvasuki commented 3 years ago

Great, let's perhaps:

  1. Establish that agreement you mention above.

Start with something like what I pointed to in https://github.com/indic-transliteration/common_maps/issues/5#issuecomment-886081337 , but be prepared to change it if superior ideas come along before you finish. That should be some easy tweak for your "scheme-transformer" script.

  1. I can get a sense of what "fix all code" means. While I can commit to doing it, it will have to be in discrete steps.

Not sure what you're expecting here - but you'll have to fix https://github.com/indic-transliteration/indic_transliteration_py/blob/d12395c91d1a62ba088c01bb8ea6ab7b69763e37/indic_transliteration/sanscript/__init__.py#L108 in case of py, and the corresponding function in case of js. Then make sure that the tests continue to pass.

To start with, I propose

  1. we take BARAHA, IAST and DEVANAGARI and start our prototype map.
  2. write the code (I can do python, but JS is going to be tough.)

Good idea.

The actual filling out of the map will have to happen as time permits.

Not sure what you mean by "filling out of the map" - some "scheme-transformer" script should do it in a jiffy.

skmnktl commented 3 years ago

[0] [0.devanagari] f = ["अ"] h = [""] [0.baraha] f = ["a"] h = ["a"] [0.iso15919] f = ["a"] h = ["a"]

[1] [1.devanagari] f = ["आ"] h = ["ा"] [1.baraha] f = ["A", "aa"] h = ["A", "aa"] [1.iso15919] f = ["ā"] h = ["ā"]

vvasuki commented 3 years ago

You mean a single giant file encompassing all scripts? No no. Separate file for each script. (Sorry to have misunderstood earlier.)

skmnktl commented 3 years ago

Why not? There are several advantages to it: Additions and updates are much easier.

The biggest disadvantage I can think of is having to parse through the file to read / find each character might slow down transliteration. But that's easy enough to fix. Once the map is ready, each time we transliterate from some charset X to Y, we generate a dictionary mapping X to Y.

E.g. {"अ": 'a', "आ":'ā'} for deva to iast from the snippet above.

vvasuki commented 3 years ago

Why not? There are several advantages to it: Additions and updates are much easier.

I disagree. Much more scrolling. Short files are far easier to fix. Eg. See recent PR-s in this repo.

vvasuki commented 3 years ago

@arvindd - Any objection if we switch to TOML? See files at https://github.com/indic-transliteration/indic_transliteration_py/tree/a4166dcc3608d6a36e385029aa6fddc98ec81d03/indic_transliteration/sanscript/schemes/data_toml .

Motivation:

vvasuki commented 3 years ago

(FWIW, I've added a migrator at https://github.com/indic-transliteration/indic_transliteration_py/blob/master/indic_transliteration/sanscript/schemes/migrator.py )

skmnktl commented 3 years ago

@vvasuki Apologies! I thought it was a straight look up and didn't anticipate things breaking with the addition of new keys.

That said-- while we're doing all this anyway-- would the following format address any concerns? A lot less scrolling, and we keep like chars in the same file.

I hacked the file together using the table on wiki, so it's missing half-forms and the like. But I figured I might as well be vocal while we're thinking about this migration. To me the migrated toml files linked above don't address the issues I'm concerned about: namely using position to identify equivalent characters.

File 0.toml

[0] ISO = "a" 7bitISO = "a" Deva = "अ" Aran = "اَ\u200e" Beng = "অ" Guru = "ਅ" Gujr = "અ" Orya = "ଅ" Taml = "அ" Telu = "అ" Knda = "ಅ" Mlym = "അ" Sinh = "අ" Brha = "a"

File 1.toml

[1] ISO = "ā" 7bitISO = "aa" Deva = "आ" Aran = "آ\u200e" Beng = "আ" Guru = "ਆ" Gujr = "આ" Orya = "ଆ" Taml = "ஆ" Telu = "ఆ" Knda = "ಆ" Mlym = "ആ" Sinh = "ආ" Brha1 = ["A", "aa"]

vvasuki commented 3 years ago

That said-- while we're doing all this anyway-- would the following format address any concerns? A lot less scrolling, and we keep like chars in the same file.

Better to have one file per script, rather than one per character. Often one wants to fix the map for a script while looking at ligatures for various characters in that script without reference to other scripts.

vvasuki commented 3 years ago

Just updated the proposed map - https://github.com/indic-transliteration/indic_transliteration_py/commit/af223f06e235eeacba019be12bcf1c01b67ec031 . Perhaps this is what you seek? It is clearer and less hairy to modify.

skmnktl commented 3 years ago

This is much better, but I'm suggesting something slightly different. Each file should represent a single character. So the file below is named "0", but it represents the character "a". This avoids one specific issue-- if a character doesn't exist in devanagari, we don't have to do some contrived thing to get it in there.

More precise mathy explanation maybe?

Character sets mappings aren't bijective functions, so rather than picking a single domain that we map everything to, I prefer defining equivalence classes. This makes it easier to define defaults, i.e. take unions of incomplete domains; for example, if I am transliterating from Kannada to Telugu, this mapping requires me to move from Kannada to Deva to Telugu. That can be awkward for the vowel "L" we have in telugu and kannadam; ಌ | ೡ are unique to these languages. WIth an equivalence class for that character, the move is just direct. And if a character doesn't exist for a given range, we just order the existing ranges to default to a character in a domain that the mapping exists for.

Example of Issue

So in the example below, suppose we're mapping from Malayalam to Urdu, since the latter character doesn't exist, we have an order of preference, say ISO>Aran>Beng. It finds the first of these and uses that. It's also much easier to add to. Grep some character in an encoding you know you have, and add the new char to it. So if you wanted to add SLP1 to this list, it's just SLP="a" to 0.toml and save the file.

File 0.toml [0] ISO = "a" 7bitISO = "a" Deva = "अ" Aran = "اَ\u200e" Beng = "অ" Guru = "ਅ" Gujr = "અ" Orya = "ଅ" Taml = "அ" Telu = "అ" Knda = "ಅ" Mlym = "അ" Sinh = "අ" Brha = "a"

skmnktl commented 3 years ago

One more point: The mapping I'm proposing is a class of functions of which the ones you defined are a subset: f: Deva -> SCRIPT_X.

vvasuki commented 3 years ago

This is much better, but I'm suggesting something slightly different. Each file should represent a single character.

I understood what you're suggesting. Just that it is inferior from the perspective of maintenance and contributions. If I told someone who wants to contribute a new map - "mm - edit these 56 files" (even if the edits are small), imagine how that person will feel. Contrast this with what you did with the baraha map.

And it is irritating to have to navigate through lot of guck one does not care about (representations in other scripts) in order to fix something.

More precise mathy explanation maybe?

Character sets mappings aren't bijective functions, so rather than picking a single domain that we map everything to, I prefer defining equivalence classes. This makes it easier to define defaults, i.e. take unions of incomplete domains; for example, if I am transliterating from Kannada to Telugu, this mapping requires me to move from Kannada to Deva to Telugu.

"Kannada to Deva to Telugu" is transfromed into "Kannada to Telugu" within the programs. Our definitions are simpler - and minimal - meant to be more human friendly than machine friendly. It's up to the consumer to get efficient mappings (which by the way are stored in memory for speed in case of python) out of these.

That can be awkward for the vowel "L" we have in telugu and kannadam; ಌ | ೡ are unique to these languages.

Who told you this? These letters are present in devanAgarI as well - ऌ ॡ । More generally the assumption here is that we can write ANY sound (in any Indic language we care about) in devanAgarI, with the use of nuktas etc..

Example of Issue

So in the example below, suppose we're mapping from Malayalam to Urdu, since the latter character doesn't exist, we have an order of preference, say ISO>Aran>Beng.

That's not hard to do no matter what representation we use to define scripts/ schemes. Implementation issue.

skmnktl commented 3 years ago

Got it. We're now in the realm of personal preference I guess? I prefer the lower cognitive load of 56 files each with the same data. And the idea of devanāgarī primacy irks me. But a second look at all the files tells me that all files have all the characters, so we're good I guess.

vvasuki commented 3 years ago

@arvindd - Any objection if we switch to TOML? ~See files at https://github.com/indic-transliteration/indic_transliteration_py/tree/a4166dcc3608d6a36e385029aa6fddc98ec81d03/indic_transliteration/sanscript/schemes/data_toml~ See https://github.com/indic-transliteration/common_maps/issues/5#issuecomment-888816053

Awaiting response for above from @arvindd and reopening with changed issue title.

arvindd commented 3 years ago

@arvindd - Any objection if we switch to TOML? See files at https://github.com/indic-transliteration/indic_transliteration_py/tree/a4166dcc3608d6a36e385029aa6fddc98ec81d03/indic_transliteration/sanscript/schemes/data_toml .

Motivation:

  • TOML is simple, but allows comments and is more readable.
  • Comments are sometimes desirable, but @skmnktl 's recent baraha map (which I unwittingly accepted) broke both js and py tests because of added comment fields.
  • Good coverage of languages - https://github.com/toml-lang/toml/wiki - though I don't see F# there. If really necessary, we can have an automated CI job to produce json on a separate branch.

Sure @vvasuki, no problems moving to TOML. I agree that's the best format too, especially because it supports comments. There are atleast two C# parsers for TOML - so any .NET language (which includes F#) can easily use that - so there probably is no need to also create JSONs from TOML. Of course, if I get into any problems there (i.e., if those parsers are not good / half implemented / etc.), we always have JSON to fall back to as you mentioned.

Thanks for notifying me!

vvasuki commented 3 years ago

It was hairier than expected, but switched to TOML now. JS and PY tests pass.