virtualvinodh / aksharamukha

Aksharamukha
159 stars 41 forks source link

ü instead of u when convertng Devanagari to IASTpali #176

Closed bdhrs closed 2 years ago

bdhrs commented 2 years ago

i'm using the python module to convert Devanagiri to Roman transliterate.process("autodetect", "IASTPali", text_extract) and the resulting text has the letter ü instead of u. e.g. saüddesaṃ instead of sauddesaṃ. ü is non-existent in Pāli texts.

virtualvinodh commented 2 years ago

That's to differentiate between "sau" and "sa_u" (with a vowel hiatus).

Apparently, this is a often used convention in Prakrit texts, where vowel hiatus is quite common.

V

On Sun, 6 Mar 2022, 18:20 bdhrs, @.***> wrote:

i'm using the python module to convert Devanagiri to Roman transliterate.process("autodetect", "IASTPali", text_extract) and the resulting text has the letter ü instead of u. e.g. saüddesaṃ instead of sauddesaṃ. ü is non-existent in Pāli texts.

— Reply to this email directly, view it on GitHub https://github.com/virtualvinodh/aksharamukha/issues/176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIASX4ZJS7ITWVXQRGHOVLU6TSMFANCNFSM5QBMBD5Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

bdhrs commented 2 years ago

pāli has a fixed system of 41 written letters. so which system do you recommend i use? at the moment i just do a regex sub afterwards.

virtualvinodh commented 2 years ago

The problem is romanization.

In Indic systems, /sau/ and /sa_u/ are graphically different and both cannot be confused with each other. However, when you romanize them both end up as /sau/, hence the need for disambiguation.

It also depends why you want to romanize them. If you think, the reader/user wouldn't read /sauddesam/ as saud-de-sam but rather sa-ud-de-sam (because Pali doesn't have /au/ and any /au/ sequence must be read as /a-u/ with a hiatus), you can ignore the disambiguation.

Even if it's for some internal machine learning purpose, I would still differentiate it, since any /au/ sequence most probably indicates a morpheme boundary (sa-uddesam) in your example.

TLDR; it depends. If you know how Pali works and doesn't confuse it for they vowel /au/ then just don't show the distinction. Else, you should disambiguate it by some method (I have chosen the umlaut)

V

On Sun, 6 Mar 2022, 18:30 bdhrs, @.***> wrote:

pāli has a fixed system of 41 written letters. so which system do you recommend i use? at the moment i just do a regex sub afterwards.

— Reply to this email directly, view it on GitHub https://github.com/virtualvinodh/aksharamukha/issues/176#issuecomment-1060005490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIASX6GEXGGVB5QJ2WQNW3U6TTR5ANCNFSM5QBMBD5Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

bdhrs commented 2 years ago

ok great i understand where you are coming from.

if you look at a large pāli corpus like chaṭṭha saṅgāyana (devanagari roman), you will see there are only a handful of examples of au and all of them represent hiatus. (āgantuka_upakkilesa, ariyasāvaka_upāsaka, catutthajjhāna_upamā, eka_usabhagāmī, paritta_udaka, sa_uddesa, sa_udraya, sa_upādāna, sa_upaghāta, sa_upanisa, etc.).

all the dipthongs disappeared in pāli, au became o (sk daurmanasya / pa domanassa) or ā (sk gaurava / pa gārava) or simply a short u (sk auddhatya / pa uddhacca). the same happened with ai, becoming e (sk caitya / pa cetiya) or ā (sk asyai / pa assā ) or simply i (sk caitra / pa citta). so there's no need to represent something which doesn't exist.

my suggestion would be to make an IASTprakrit for those cases where you need to disambiguate, and perhaps keep IASTpali to its well-established 41 letter set. {a, ā, i, ī, u, ū, e, o, k, kh, g, gh, ṅ, c, ch, j, jh, ñ, ṭ, ṭh, ḍ, ḍḥ, ṇ, t, th, d, dh, n, p, ph, b, bh, m, y, r, l, s, v, h, ḷ, ṃ}