Document OCR recomendation?

MichaelRinger commented 3 years ago

Hi,

can someone recommend me a good and free OCR to use in combination with LayoutLM?

CacTt4ck commented 3 years ago

I think you can use Pytesseract, it's quite simple to use.

ninjakx commented 3 years ago

Use craft for word detection and then use pytesseract for recognition.

MichaelRinger commented 3 years ago

I want to use it on CV's and had pretty inacurate results with pytesseract. Id like to have similar results like googles vision ai. https://cloud.google.com/document-ai?hl=en Is pytesseract with craft a lot better?

ninjakx commented 3 years ago

It will improve the result a bit. The major issue is with the tesseract is that you need to preprocess the image and need to binarize it before passing it to the tesseract. Choosing the best threshold method and blur kernel size and value will be a tedious task and it might not be efficient for all the cases. If you use craft which are trained on wild scene text. You might be able to detect the word irrespective of the background and then you can easily apply tesseract on it(optional: binarize the cropped detected word and then passed it).

If you rely on tesseract only for detection also you might be not able to capture every words. So better to use this combo. craft + pytesseract.

I find the result to be good when I used these two combo than the tesseract alone for receipt data.

ninjakx commented 3 years ago

If you want to improve the result in the recognition side as well then try to retrain the tesseract on custom fonts dataset :)

ruifcruz / sroie-on-layoutlm

Document OCR recomendation? #3