How to extract by coordinates

thiagoalessio / tesseract-ocr-for-php

A wrapper to work with Tesseract OCR inside PHP.

https://packagist.org/packages/thiagoalessio/tesseract_ocr

MIT License

2.86k stars 551 forks source link

How to extract by coordinates #194

Closed stccorp closed 3 years ago

stccorp commented 4 years ago

Hello, I am hoping someone can point me in the right direction. I want to be able to extract information from an image or pdf by specifying a bounding box in pixels (ex, x1,y1, x2,y2) for each field. Does this library allows something like that?

Thank you

stccorp commented 4 years ago

I was able to do it in a very simple way. In case anyone is interested. For some reason I was not getting reliable output when sending binary object instead of file. I have no idea why at this time. But creating files, and sending one file at a ocr1.txt

time worked

thiagoalessio commented 3 years ago

yeah, cropping the image before sending it to tesseract is a good solution!

another way would be to recognize everything, but use the hocr() option and select only the text present in the desired coordinates. https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html#hocr-output