tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 152 forks source link

Missing GREEK LUNATE SIGMA SYMBOL in grc and script/Greek models #55

Open nisbet-hubbard opened 9 months ago

nisbet-hubbard commented 9 months ago

Current Behavior

A lunate sigma (ϲ, U+03F2) is recognised under language ‘grc’ but is being output as a normal sigma (σς).

Expected Behavior

Outputting it as U+03F2.

Suggested Fix

No response

tesseract -v

5.3.0-6-g76ae

Operating System

No response

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

stweil commented 9 months ago

That's not an issue of tesseract, but of the model which does not include the GREEK LUNATE SIGMA SYMBOL (see unicharsets for grc and script/Greek). Therefore I move this issue to langdata_lstm.

stweil commented 9 months ago

The symbol is not recognized because it was not part of the training data. Therefore Tesseract detects another symbol which looks somehow similar.

nisbet-hubbard commented 9 months ago

Thanks for moving it. If I understand you correctly, the fact that I’m seeing the regular sigmas σ (when non-final) and ς (when final) in the OCR text whenever a lunate sigma ϲ is present in the image isn’t because the lunate sigma gets actually recognised as a sigma, but rather just because ϲ looks similar to σ/ς.

stweil commented 9 months ago

Yes, that's right.