Unable to detect simple math equations using pytessract

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

Apache License 2.0

61.96k stars 9.48k forks source link

Unable to detect simple math equations using pytessract #3028

Closed NavpreetDevpuri closed 4 years ago

NavpreetDevpuri commented 4 years ago

Environment

Tesseract Version: tesseract v5.0.0-alpha.20200328
Platform: Windows 10 64-bit

Current Behavior:

I download latest traineddata files from tessdata I tried following code

img = cv2.imread(file_path)
hocr_data = pytesseract.image_to_pdf_or_hocr(img, extension='hocr', lang="eng+equ")
with open("test.html", 'w+b') as f:
     f.write(hocr_data)

Input file 006

Output looks something like chrome_k33xFY87or

Expected Behavior:

It should detect those equations.

Suggested Fix:

Use LaTaX as a new language to detecting math equations. Then we can easily put those LaTax math equations into hocr file as mentioned at this link

zdenop commented 4 years ago

tesseract is not suitable for this king of input text.

NavpreetDevpuri commented 4 years ago

@zdenop Why not ? Why not improve tesseract for that kind of text detections ? this is also a part of text detection. Most of the documents include some kind of equations. So, i think detecting equations are also important as detecting english.

zdenop commented 4 years ago

Sure you can. Just send a patch and I will merge it.

NavpreetDevpuri commented 4 years ago

can you help me with this How ? May be it need some testing data for training ? And how to train it ? Sorry but i am new with tesseract. Do you have any better approach ?

NavpreetDevpuri commented 4 years ago

i found that Let me try !

zdenop commented 4 years ago

No training will not help you. There are dedicated solution for such task e.g. https://mathpix.com/

NavpreetDevpuri commented 4 years ago

No training will not help you. There are dedicated solution for such task e.g. https://mathpix.com/

yes, i looked at it but its not free and open source

Now i am finding more solutions i got few as far now http://www.inftyproject.org/en/software.html https://github.com/UW-COSMOS/latex-ocr https://github.com/blaisewang/img2latex-mathpix

Now i am testing those

NavpreetDevpuri commented 4 years ago

Leaving it for now may be i will try it later

But i think its much more better if tesseract should have LaTax as an other language option for reseach documents or any kind of documents that have math equations.

roostinghawk commented 4 years ago

So until now you still couldn't find solution for detecting math signal? I am struggling with this problem too...

Breakfastisready commented 4 years ago

Hello!

Have you found any alternatives yet @NavpreetDevpuri ?

I am also interested.

varunsawhney8 commented 3 years ago

Anyone Found Something for this problem. Please let me know.

gulabsagevadiya commented 2 years ago

Anyone has found any solution for this need? I am also looking for simple math equations to LaTeX

daianaszwimer commented 1 year ago

same here, sad to see that there are no answers to this question :(

giovannav commented 1 year ago

I added this configuration line of code to the function:

myconfig = r"--psm 11 --oem 3"
text = pytesseract.image_to_string(cropped_img_loc, config=myconfig, lang='eng')

It doesn't work perfectly, but it does catch most of the most common math symbols.

adzcai commented 1 year ago

Any updates? I'm also looking for a tool to identify equations and replace them with LaTeX. It would be enough just to get bounding boxes; then I can crop those and feed them into a different tool.

I noticed equationdetect.cpp. Is there a convenient way to use that to extract bounding boxes?

OskarGruberPR commented 1 year ago

... same question :)

Kiranism commented 1 year ago

yo! any workaround

Shadow-Alex commented 1 year ago

I've long investigated this area, it's quite hard and few people are working on this, you can always screen through ICDAR papers for any luck :)

Hadar933 commented 1 year ago

still no workaround huh?

Shadow-Alex commented 1 year ago

Try this: https://github.com/lukas-blecher/LaTeX-OCR.

For a full ocr version, try this : https://github.com/breezedeus/Pix2Text