No difference between ه [h] and ح [h] in dictionary

Jargonautika commented 2 years ago

Obvious care has been taken to make sure the pronunciation of ه (hā-ye do-češm) has been differentiated between its pronunciation as /h/ (word-initially and -medially) and /e/ (word-finally) as in:

[ ] ماهه m A h e

Examples like that above are useful to make sure when the grapheme should be pronounced as [h] or [e]. However, there does not seem to be a distinction between the [h] pronunciation of ه (hā-ye do-češm) and the [h] pronunciation of ح (ḥâ-ye ḥotti / ḥâ-ye jimi) anywhere in the dictionary. Consider the following examples:

[1] حادثه h A d e s e
[2] حوزه h o z e
[3] صحیح s a h i h
[4] آنها A n h A

If this dictionary were to be used in its reverse form, [1] could be reconstituted from "h A d e s e" into either "هادِثه" or "حادِثه". This is certainly a niche issue, but I am trying to diacritize non-diacritized text, and so in order to re-constitute the original text I have with included vowels given your dictionary's scheme, I need to know which Farsi character to convert back to in the end. There are no instances in either dictionary where ḥâ-ye jimi and hā-ye do-češm (pronounced as [h]) appear in the same word so a simple string replace should do it there. I suggest replacing [h] with [H] for ḥâ-ye jimi to make this dictionary reversible.

It may well be that there are no words in Farsi which contain both ح and ه, but this would solve the edge use-case I describe here.

Thanks for your work!

Jargonautika commented 2 years ago

In using this more, this is actually also true of a number of different pairs:

س and ص
د and ض, etc. It's just not reversible.

b00f commented 2 years ago

@Jargonautika

If this dictionary were to be used in its reverse form

We use this dictionary to predict the pronunciation of a given word. If I am not mistaken you are looking for a reverse function. We have many homophone words in Persian that have different meaning but same pronunciation. For example:

h a y A t can be written in these forms: حیاط (courtyard) حیات (life)

In this repository we focused on pronunciation. However there is another project that try to latinized Persian: Alefbaye 2om You may also check it out.

شاد و سلامت باشید

tihu-nlp / tihudict

No difference between ه [h] and ح [h] in dictionary #3