Serbian Cyrillic punctuation simbols and marks are mistakenly recognized as Russian

NiKola-UE commented 2 months ago

Hello for all.

Almost everything about this problem has already been said in this issue, so I'll just quote some key points here.

"Serbian Cyrillic punctuation simbols and marks are mistakenly recognized as Russian. This mostly happens when reading an already written text, while it doesn't happen when typing."

"Serbian and Russian Cyrillic are to some extent similar, but they differ drastically, so the names of those characters in these languages also differ, which can confuse users who do not know Russian or perhaps even frustrate those who simply do not want the punctuation in Serbian texts to be pronounced On russian."

"Sometimes it also happens that the pronunciations of Serbian and Croatian Latin characters are mixed up, but this is much less common, even though these two Latin characters are exactly the same, with the fact that there are minor differences between their names."

"[...] However, the main reason for opening this issue concerns the non-recognition of punctuation simbol promounciation in Serbian Cyrillic because all speech synthesizers incorrectly identify them as Russian, which most often happens when reading an already written text in a any text editors or on a web page that are written in Serbian (Cyrillic) when I use the arrows for navigation through the text; if I was a little clearer now."

"Like I said, I'm willing to fix it if I can at all."

"I don't know how synthesizers affect the correct pronunciation, but until recently I used the commercial AlfaNum's AnReader, and lately, except ESpeak, I also use the Newfon (which is currently only available as a NVDA add-on), and also FOSS RHVoice, in the development of which I want to be involved; depending on the needs, since I like to experiment. But in all cases the result is the same, with the fact that ESpeak still has the problem of misidentifying both Serbian Latin and Croatian as English, which is the problem of that synthesizer. Most punctuation marks are misidentified. So, . (tačka / тачка) is точка, , (zarez / зарез) is запятая, ! (uzvičnik / узвичник) is восклицательный знак, ? (upitnik / упитник) is вопросительный знак, etc. The same happens with recognition of the emoticons."

"I don't use ESpeak that often because when I read e-books or edit some text, I prefer to use synthesizers with natural voices. In addition to web pages, the indicated errors with punctuation symbols also appear when I read a document in a text editor, but they do not occur when I modify and edit that document. But when I save the changes, the same errors appear again, so it turns out that the characters are recognized based on the most closely related alphabet, and not the language, which does not happen in the case of the Latin alphabet. Again, with documents and pages written in Serbo-Latin, this does not happen..."

"However, it is possible that there are errors in connection with an AnReader, which I used most often and somehow got used to it. And the majority of users in Serbia still use that reader most often."

"By the way, all synthesizers and NVDATTS add-ons that support the Serbian language also support the Cyrillic alphabet and no any problems with it..."

"Adding new letters for the same languages is only good for the programming interface - synthesizers that supported Cyrillic will continue to support it, those that didn't won't and that's perfectly fine. The Cyrillic text is perfectly readable on all pages, only the punctuation is incorrectly recognized as Russian when I move through the text with the left and right arrows. If it helps in any way, I can send the audio files for the demonstrating it, but I don't know exactly where to forward them. Again, maybe it depends on the synth."

"The automatic language change option is good precisely because of the recognition of different languages and possibly dialects, which is good when using multilingual syntheses that support language and voice synchronization (like ESpeak or Nuance's Vocalizer Expressive)."

"As for the Microsoft's One Core Voices, it is understandable that the Croatian or Slovenian voices does not read Cyrillic because that script has not been used in Croatia since the XIX century, although most Croats still know it. Interestingly, Serbian One Core Voices can only read Cyrillic, but those voices cannot be used in Windows; at least as far as I know. On the other side, Lana, a Croatian voice of the aforementioned Vocalizer can read Cyrillic (Slovenian voice Tina cannot), but there are problems with the pronunciation of written letters, just like with Latin diacritical marks, which is of course a problem with that synthesis. Bulgarian, Russian and Ukrainian voices can read Serbian Cyrillic more or less well, but they cannot recognize some letters that do not exist in the Cyrillic alphabet of those languages. The same is for Russian, Ukrainian and Macedonian voices of the RHVoice, which also has to do with those syntheses, not with NVDA. Newfon support only the Serbian Latin in to the interface, but reads Cyrillic flawlessly; probably also because the authors are from this area, ie. from the Balkan (Ex Yugoslavia)."

"Without further ado, I agree that the problem is not a big one. It doesn't bother me much and I use NVDA in English, but it can confuse some other users, mainly beginners who are just getting acquainted with these things and may not understand what it is actually about..."

"Yes, in the last comment, Nidžo perfectly and accurately summarized everything that I was talking about here. Regarding AnReader, although it reads Cyrillic, Serbian Latin is the default script, so the problem may be in that."

"When I have more time, I will also upload short audio recordings in which Serbian-speaking synthesizers pronounce the same text and symbols, so that the obtained results can be compared. However, those recordings will not be useful to those who do not know Serbo-Croatian. ESpeak still has problems with proper language and letter recognition, which is again a problem with that synthesizer."

I apologize if I did something wrong in opening or considering this issue. Thank you in advanced.

seanbudd commented 2 months ago

Closing this as it doesn't formulate a clear issue. Please keep discussion to the discussion. When a clear issue is formulated, please open a new issue filling out the bug report template fully.

CyrilleB79 commented 2 months ago

@NiKola-UE, any reason why you have not used the template?

In the future, e.g. after @seanbudd has answered to https://github.com/nvaccess/nvda/discussions/16465#discussioncomment-9338028, if you need help to use the template or you even do not know what I am speaking about, just let me know.

nvaccess / nvda

Serbian Cyrillic punctuation simbols and marks are mistakenly recognized as Russian #16476