tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 153 forks source link

wrong default mapping of some Romanian diacritics #37

Open latrau opened 6 years ago

latrau commented 6 years ago

Environment

Debian Linux

Current Behavior:

using the ron option (Romanian):

romanian diacritics șȘțȚ are mapped into the wrong Unicode codes, namely: Ș -> Ş=U+015E ș -> ş=U+015F Ț -> Ţ=U+0162 ț -> ţ=U+0163

Expected Behavior:

Ș -> Ș=U+0218 ș -> ș=U+0219 Ț -> Ț=U+021A ț -> ț=U+021B

Suggested Fix:

edit the map accordingly;

zdenop commented 6 years ago

Where is input image or something that would demonstrate problem?

latrau commented 6 years ago

the Romanian typographical convention is that the diacritics s and t are with a comma below not with cedilla (as specified also in UNICODE Latin ext A and B).

best would be that any diacritical s or t in the -ron (Romanian) option should be mapped into Latin ext B code above; meaning that in the tesseract's ron unicharset there should be no trace of [15e ] [15f ] [162 ] or [163 ], only [218 ]-[21a ].

e.g. screenshot at 2018-02-10 22-18-06 screenshot at 2018-02-10 22-17-33

the wrong mapping is everywhere once the -ron option is selected...

let me quote UNICODE 10 (chap.07) on this:

The Unicode Standard provides unambiguous representations for all of the forms, for example, U+0219 ș latin small letter s with comma below versus U+015F ş latin small letter s with cedilla. In modern usage, the preferred representation of Roma- nian text is with U+0219 ș latin small letter s with comma below, while Turkish data is represented with U+015F ş latin small letter s with cedilla.

same goes for ȘțȚ.

so option -ron means șțȚȘ [U+0218-A] with no ambiguity and should nowhere involve şŞŢţ [U+015e-f][U+0162-3].

amitdo commented 4 years ago

This issue is not caused by Tesseract itself. It should be moved to another repo (not sure which one).

stweil commented 4 years ago

I think langdata_lstm is a good one and transfer the issue.

stweil commented 4 years ago

@latrau, so each of the wrong characters should be replaced? Do you want to send a pull request which fixes ron.training_text, maybe also ron.singles_text and ron.wordlist?

stweil commented 4 years ago

@latrau, was cedilla used in historic Romanian texts? If yes, it might be a good idea to keep both forms (with cedilla for the historic characters and with comma for the modern ones).