madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.76k stars 715 forks source link

greek langage letter #547

Open Quetzaal opened 3 months ago

Quetzaal commented 3 months ago

hi, i've dowload and add "greek" OCR for tesseract, in tessdata (grc.traineddata) ... and it is available in python pytesseract print(pytesseract.get_languages(config='')) => ['ell', 'eng', 'equ', 'fra', 'frm', 'grc', 'lat', 'osd']

but, the OCR of a png-wiki-greek-page, do not send any greek letter... may i do something wrong... it's not o long time i use pytesseract, but it works fine in FR, EN, LAT... then why not greek ?

thanks for you anwser & explanation

stefan6419846 commented 3 months ago

Please try to run this page through plain Tesseract - I suspect that this is a general Tesseract issue and not related to pytesseract.

Quetzaal commented 3 months ago

tesseract-OCR works as it should... but nothing to do have it by pyTesseract «  Κατά την πρώιμη τυπογραφία των έργων στην ελληνική γλώσσα περί το 1500, πολλά από τα συμπλέγματα αυτά υιοθετήθηκαν από παλαιότερα χειρόγραφα. Σηµαντικά δείγματα από την Περίοδο αυτή προήλθαν από τα σχέδια του Άλδου Μανούτιου στην Βενετία, Και του Κλωντ Γκαραμόντ στο Παρίσι ο οποίος δημιούργησε την Ιδιαπερα εππυχηµένη γραμματοσειρά 6γεο5 ου γοίτο 1841 »

but nothing to do to have the same by pyTesseract i used : « txt = pytesseract.image_to_string(Image.open(adrimg), lang)  » into a function (def) which works for french and latin

then if you can do something, thanks

stefan6419846 commented 3 months ago

Your write-up is a bit confusing. In theory, when using the same parameters for both cases, the results should be identical. I am not aware that pytesseract would knowingly corrupt any output. For this reason, please provide some more details and possibly an example image as well as the parameters used for both cases.

Quetzaal commented 3 months ago

my mistake was weird... everthing works... i go burry myself somewhere where i'll be forgeted forever, hahaha