ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
12.8k stars 936 forks source link

Azure ocr with ocrmypdf #595

Open sandipan1 opened 3 years ago

sandipan1 commented 3 years ago

ocrmypdf works great with pdfs with scanned images . However in case of handwritten letter, the tessaract-ocr engine struggles many a time. How do I use Azure ocr API as the OCR engine keeping everything else the same

jbarlow83 commented 3 years ago

OCRmyPDF has a plugin interface that would allow you to replace Tesseract with a different OCR engine such as Azure. To the best of my knowledge no one has published a plugin that does this (or for that matter, any plugin, since the plugin interface is quite new).

OCRmyPDF can only interpret the hOCR format or a text only PDF, so you'd have to convert Azure's output to one of those two as well, since unfortunately it does not support either standard (last time I looked, anyway).

sandipan1 commented 3 years ago

The azure output looks something like {"status": "Succeeded", "recognitionResult": {"lines": [{"boundingBox": [292, 146, 780, 144, 781, 218, 293, 220], "text": "string1", "words": [{"boundingBox": [297, 150, 774, 145, 775, 218, 300, 218], "text": "string2"}]}, {"boundingBox": [327, 215, 748, 219, 747, 255, 326, 252], "text": "string3 string4", "words": [{"boundingBox": [330, 219, 496, 219, 498, 253, 332, 251], "text": "string3"}, "text": "string4"}]}]}}

Is it possible to convert this into one of the formats that you mentioned ?

PackElend commented 3 years ago

Is it possible to convert this into one of the formats that you mentioned?

If you look at hOCR format example given on Wikipedia, I would say yes. Besides, have look here: https://stackoverflow.com/questions/62074677/generate-hocr-from-microsoft-computer-vision-ocr

Another alternative could be https://github.com/JaidedAI/EasyOCR but it outputs in a simple list only. I think that could be converted in hOCR easily.

All3xJ commented 2 years ago

I asked here https://github.com/ocrmypdf/OCRmyPDF/issues/915 for google api, but still the same question of yours. the ideal would be to grab the informations from these websites api, and then "paste" it inside the PDF as invisible text, like OCRmyPDF altrady does.

kkrell2016 commented 1 year ago

I asked here #915 for google api, but still the same question of yours. the ideal would be to grab the informations from these websites api, and then "paste" it inside the PDF as invisible text, like OCRmyPDF altrady does.

Its not that hard to implement. I have a very basic python code, that uses google vision api to get better results. For the orientation of the page I use tesseract, because its the easiest way. But for text recognition the image is sent to vision api, the json response get converted to hocr and you have your textlayer.

isspid commented 1 year ago

@kkrell2016 May I ask how the conversion from json to hocr happen? Have you written your own script for that purpose?

kkrell2016 commented 1 year ago

@isspid I found a project called gcv2hocr and combined it with some custom python code. The custom python script can be run as a plugin in ocrmypdf. I also uploaded it to my github, should be publicly available. I had to modify gcv2hocr a bit to make it work with the current Google Vision API.

If you have any questions please contact me

RAbraham commented 1 year ago

I think the above plugin interface (e.g. generate_pdf(input_file, output_pdf, output_text, options)) will be called for each page in the pdf instead of the whole pdf? Is there an interface which gives the entire pdf and we can return back a list of hocr files generated from an azure OCR result or a single hocr file for many pages(if this is possible in the hocr format, I have to learn).

Then I can then call

# something like this for multiple pages?
    helper = hocrtransform.HocrTransform(
        hocr_filename=hocr_file, # or list_of_hocr_files
        dpi=150
    )

    helper.to_pdf(out_filename=output_pdf) # a multi page pdf

Our use case is that we send a batch of pages to Azure OCR(otherwise it'll be very slow to process many pages for us) and it returns an Azure OCR result object for all the pages. I can loop through each page object of the Azure OCR result object and generate either a list of hocr objects (where one hocr object corresponds to a page) or a single hocr object(if that's possible)

RAbraham commented 1 year ago

I guess I can call the azure engine in the global part of the file and then cache it and then when generate_pdf is called, just pick it from there. but how I know which key in the cache to pick up? e.g. I'll key the cache by page number for e.g. but I won't know from generate_pdf which page it is for, as it does not provide a page number iiuc?

I noticed generate_pdfa. Would that be useful here? and then I call helper above in a loop and then merge the single page files got from helper.to_pdf?

shamoon commented 4 months ago

Curious about all this myself. Anyone have a working example of converting e.g. the Azure output to hOCR?

deajan commented 3 months ago

@shamoon Looks we found the same thread ^^ I'm currently trying to make easyocr compatible with paperless-ngx, see https://github.com/paperless-ngx/paperless-ngx/discussions/6056#discussioncomment-8801102 I found an azure to hocr script, and will probably write mine for easyocr.

hcoona commented 1 week ago

Although the document say they can produce hOCR, I cannot find any workable solution.

Doc: https://learn.microsoft.com/en-us/azure/architecture/solution-ideas/articles/cognitive-search-with-skillsets#azure-computer-vision

Issue: https://github.com/Azure-Samples/cognitive-services-REST-api-samples/issues/109

hcoona commented 1 week ago

This may helps. https://learn.microsoft.com/en-us/samples/azure-samples/azure-search-power-skills/azure-hocr-generator-sample/

code: https://github.com/azure-samples/azure-search-power-skills/tree/main/Vision/HocrGenerator