tihu-nlp / tihudict

Tihu dictionary for Persian language
Other
12 stars 1 forks source link

No difference between ه [h] and ح [h] in dictionary #3

Open Jargonautika opened 2 years ago

Jargonautika commented 2 years ago

Obvious care has been taken to make sure the pronunciation of ه (hā-ye do-češm) has been differentiated between its pronunciation as /h/ (word-initially and -medially) and /e/ (word-finally) as in:

Examples like that above are useful to make sure when the grapheme should be pronounced as [h] or [e]. However, there does not seem to be a distinction between the [h] pronunciation of ه (hā-ye do-češm) and the [h] pronunciation of ح (ḥâ-ye ḥotti / ḥâ-ye jimi) anywhere in the dictionary. Consider the following examples:

If this dictionary were to be used in its reverse form, [1] could be reconstituted from "h A d e s e" into either "هادِثه" or "حادِثه". This is certainly a niche issue, but I am trying to diacritize non-diacritized text, and so in order to re-constitute the original text I have with included vowels given your dictionary's scheme, I need to know which Farsi character to convert back to in the end. There are no instances in either dictionary where ḥâ-ye jimi and hā-ye do-češm (pronounced as [h]) appear in the same word so a simple string replace should do it there. I suggest replacing [h] with [H] for ḥâ-ye jimi to make this dictionary reversible.

It may well be that there are no words in Farsi which contain both ح and ه, but this would solve the edge use-case I describe here.

Thanks for your work!

Jargonautika commented 2 years ago

In using this more, this is actually also true of a number of different pairs:

b00f commented 2 years ago

@Jargonautika

If this dictionary were to be used in its reverse form

We use this dictionary to predict the pronunciation of a given word. If I am not mistaken you are looking for a reverse function. We have many homophone words in Persian that have different meaning but same pronunciation. For example:

In this repository we focused on pronunciation. However there is another project that try to latinized Persian: Alefbaye 2om You may also check it out.

شاد و سلامت باشید