Multiple language detection within an image

metouitude commented 2 months ago

Your Feature Request

Hello,

I'm currently working on a personal project that involves multiple languages detection, and the furthest i got is :

osd = pytesseract.image_to_osd(self.img) script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1) conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)

Which is directly taken to be honest from https://stackoverflow.com/questions/70198974/how-to-detect-language-or-script-from-an-input-image-using-python-or-tesseract-o

so for example let's say we have an image with 2 or + languages like this one for example :

downloaded_image

In this case OSD will only detect Latin with a confidence of 2.22

but at the same time pytesseract.image_to_boxes(self.img,lang="ara") is returning an arabic text,

My point is :

Will it be possible to run 3 time the osd detection in ara/latin/hebrewto return multiple languages ? to make pytesseract.image_to_osd(self.img) detects multiple languages ?

yaofuzhou commented 1 month ago

Have you tried to run Tesseract with, say, lang = ara+eng?

metouitude commented 2 weeks ago

Have you tried to run Tesseract with, say, lang = ara+eng?

Actually i have found better from chatgpt, With apt install tesseract-ocr-all this installs all the languages so now i can detect multiple languages withing a single image

tesseract-ocr / tesseract

Multiple language detection within an image #4238

Your Feature Request