tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
59.53k stars 9.23k forks source link

Multiple language detection within an image #4238

Closed metouitude closed 2 weeks ago

metouitude commented 2 months ago

Your Feature Request

Hello,

I'm currently working on a personal project that involves multiple languages detection, and the furthest i got is :

osd = pytesseract.image_to_osd(self.img) script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1) conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)

Which is directly taken to be honest from https://stackoverflow.com/questions/70198974/how-to-detect-language-or-script-from-an-input-image-using-python-or-tesseract-o

so for example let's say we have an image with 2 or + languages like this one for example :

downloaded_image

In this case OSD will only detect Latin with a confidence of 2.22

but at the same time pytesseract.image_to_boxes(self.img,lang="ara") is returning an arabic text,

My point is :

yaofuzhou commented 1 month ago

Have you tried to run Tesseract with, say, lang = ara+eng?

metouitude commented 2 weeks ago

Have you tried to run Tesseract with, say, lang = ara+eng?

Actually i have found better from chatgpt, With apt install tesseract-ocr-all this installs all the languages so now i can detect multiple languages withing a single image