Open nisbet-hubbard opened 9 months ago
That's not an issue of tesseract
, but of the model which does not include the GREEK LUNATE SIGMA SYMBOL (see unicharsets for grc and script/Greek). Therefore I move this issue to langdata_lstm
.
The symbol is not recognized because it was not part of the training data. Therefore Tesseract detects another symbol which looks somehow similar.
Thanks for moving it. If I understand you correctly, the fact that I’m seeing the regular sigmas σ (when non-final) and ς (when final) in the OCR text whenever a lunate sigma ϲ is present in the image isn’t because the lunate sigma gets actually recognised as a sigma, but rather just because ϲ looks similar to σ/ς.
Yes, that's right.
Current Behavior
A lunate sigma (ϲ, U+03F2) is recognised under language ‘grc’ but is being output as a normal sigma (σς).
Expected Behavior
Outputting it as U+03F2.
Suggested Fix
No response
tesseract -v
5.3.0-6-g76ae
Operating System
No response
Other Operating System
No response
uname -a
No response
Compiler
No response
CPU
No response
Virtualization / Containers
No response
Other Information
No response