Closed NavpreetDevpuri closed 4 years ago
tesseract is not suitable for this king of input text.
@zdenop Why not ? Why not improve tesseract for that kind of text detections ? this is also a part of text detection. Most of the documents include some kind of equations. So, i think detecting equations are also important as detecting english.
Sure you can. Just send a patch and I will merge it.
can you help me with this How ? May be it need some testing data for training ? And how to train it ? Sorry but i am new with tesseract. Do you have any better approach ?
i found that Let me try !
No training will not help you. There are dedicated solution for such task e.g. https://mathpix.com/
No training will not help you. There are dedicated solution for such task e.g. https://mathpix.com/
yes, i looked at it but its not free and open source
Now i am finding more solutions i got few as far now http://www.inftyproject.org/en/software.html https://github.com/UW-COSMOS/latex-ocr https://github.com/blaisewang/img2latex-mathpix
Now i am testing those
Leaving it for now may be i will try it later
But i think its much more better if tesseract should have LaTax as an other language option for reseach documents or any kind of documents that have math equations.
So until now you still couldn't find solution for detecting math signal? I am struggling with this problem too...
Hello!
Have you found any alternatives yet @NavpreetDevpuri ?
I am also interested.
Anyone Found Something for this problem. Please let me know.
Anyone has found any solution for this need? I am also looking for simple math equations to LaTeX
same here, sad to see that there are no answers to this question :(
I added this configuration line of code to the function:
myconfig = r"--psm 11 --oem 3"
text = pytesseract.image_to_string(cropped_img_loc, config=myconfig, lang='eng')
It doesn't work perfectly, but it does catch most of the most common math symbols.
Any updates? I'm also looking for a tool to identify equations and replace them with LaTeX. It would be enough just to get bounding boxes; then I can crop those and feed them into a different tool.
I noticed equationdetect.cpp. Is there a convenient way to use that to extract bounding boxes?
... same question :)
yo! any workaround
I've long investigated this area, it's quite hard and few people are working on this, you can always screen through ICDAR papers for any luck :)
still no workaround huh?
Try this: https://github.com/lukas-blecher/LaTeX-OCR.
For a full ocr version, try this : https://github.com/breezedeus/Pix2Text
Similar to https://github.com/tesseract-ocr/tesseract/issues/2204 and https://github.com/tesseract-ocr/tesseract/issues/1890
Environment
Current Behavior:
I download latest traineddata files from tessdata I tried following code
Input file
Output looks something like
Expected Behavior:
It should detect those equations.
Suggested Fix:
Use LaTaX as a new language to detecting math equations. Then we can easily put those LaTax math equations into hocr file as mentioned at this link