indic-transliteration / sanscript.js

Transliteration package for Indian scripts
MIT License
98 stars 39 forks source link

Wrong diacritics for Devanagari -> ISO/IAST/ITRANS #43

Open bwasty opened 2 years ago

bwasty commented 2 years ago

I found several issues with transliterating diacritics from Devanagari (Hindi):

vvasuki commented 2 years ago

I found several issues with transliterating diacritics from Devanagari (Hindi):

  • कॅ -> kaॅ (iso/iast, fine in itrans: ka.c)

What should this be in ISO?

  • सड़क -> saḍa़ka (iso/iast)

What should this be in ISO?

  • फ़ोन -> pha़ōna (iso/iast)

f is expected I suppose. Contribute a fix?

  • ज़्यादा ->ja़yAdA (itrans; other way correct: zyaada)

Contribute a fix?

By the way, great project, wrote 2 small tools with it already:

bwasty commented 2 years ago
  • कॅ -> kaॅ (iso/iast, fine in itrans: ka.c)

What should this be in ISO?

m̐k. Same in IAST (according to this. Here ˜ is shown, though the discussion page suggests is correct)

  • सड़क -> saḍa़ka (iso/iast)

What should this be in ISO?

saṛaka in ISO. For IAST it's not specified - so remove the dangling dot maybe? or use the same? For ITRANS it should be .Da or .Ra.

Related: ढ़ should become ṛha in ISO and .Dha/Rha in ITRANS.

  • फ़ोन -> pha़ōna (iso/iast)

f is expected I suppose. Contribute a fix?

Yes, for ISO and ITRANS. For IAST it's not specified - maybe do the same anyway?

  • ज़्यादा ->ja़yAdA (itrans; other way correct: zyaada)

Contribute a fix?

I'm not sure I understand Devanagari well enough yet (literally started learning a week ago), but I might try :)

vvasuki commented 2 years ago
  • कॅ -> kaॅ (iso/iast, fine in itrans: ka.c)

What should this be in ISO?

m̐k. Same in IAST (according to this. Here ˜ is shown, though the discussion page suggests is correct)

No - you seem to be confusing कँ with कॅ.

bwasty commented 2 years ago

Ah, right, damn. Wikipedia shows ê for and . The unicode block shows a few more characters with a 'candra', but I guess they have no transliteration?

vvasuki commented 2 years ago

Basically, problem is that transliterateBrahmic assumes that it's ok to transliterate character by character. It does not consider max token length (unlike https://github.com/indic-transliteration/indic_transliteration_py/blob/99fe6b2fd5b220794d1709e3297c919d58c4cfcc/indic_transliteration/sanscript/brahmic_mapper.py ). Porting the python code might work.

bwasty commented 2 years ago

Ok, I'll look into that after having a stab at #42 (since that 'annoys' me more and I found this interesting paper)