pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.02k stars 482 forks source link

Support for Page Segmentation Mode for calling Tesseract OCR #3122

Open stevesimmons opened 7 months ago

stevesimmons commented 7 months ago

Feature request

Can OCR using Tesseract add a user-settable parameters for page segmentation mode (psm)?

This would be very useful because when source documents are forms, OCR recognizes the scattered pieces of text much better with psm 11 than the default psm 3.

It would be easiest with an optional parameter for psm in Page.get_textpage_ocr like this:

tp = page.get_textpage_ocr(dpi=300, full=True, psm=11, tessdata="...")

Benefit

Here's an example with a one-page form I tried it on.

PyMuPDF today extracts 483 characters using standard "full page" OCR. Calling Tesseract directly with psm 11 gets 703 characters, 40% more. The missing text makes a huge amount of difference!

doc = fitz.Document(stream=raw)

# Standard PyMuPDF OCR
page = doc[0]
tp = page.get_textpage_ocr(
    flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_PRESERVE_IMAGES,
    dpi=300, full=True, tessdata='/usr/share/tesseract-ocr/5/tessdata',
)
text = tp.extractTEXT()
print(len(text))                                # Default OCR on my sample doc got 483 characters

# Calling Tesseract directly, setting psm to 11 for disconnected text
pm = page.get_pixmap(dpi=300)
img = pm.tobytes('png')
rc = subprocess.run(
    "tesseract stdin stdout --psm 11 -l eng",
    input=img, stdout=subprocess.PIPE, shell=True,
)                                                 
text = rc.stdout.decode()
print(len(text))                               # OCR with psm=11 got 704 characters

Implementation notes

The new psm parameter would need to be passed to Pixmap.pdfocr_save(....). (L8369 in https://github.com/pymupdf/PyMuPDF/blob/056e3e43c8b99b6ec9657d7e4edb398f7826c03c/src_classic/fitz_old.i)

And then MuPDF's pixmap.ocr_recognize (L231 in https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/fitz/tessocr.cpp).

I found an example of how the psm parameter is set in the Tesseract C API docs: https://tesseract-ocr.github.io/tessdoc/APIExample.html

  PIX *image = pixRead(inputfile);
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init("/usr/src/tesseract/", "eng");
  api->SetPageSegMode(tesseract::PSM_AUTO_OSD); /* We'd need our input PSM here! */
  api->SetImage(image);
  api->Recognize(0);

Background on PSMs

A good writeup of the various page segmentation modes is here: https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/

stevesimmons commented 7 months ago

Actually after experiments with more PSMs in Tesseract, I'm now less sure PSM makes a difference. I half suspect there's something between Tesseract and get_textpage_ocr that drops output.

I eliminated the chance that PyMuPDF was only sending part of my page to Tesseract by doing get_textpage_ocr on the pixmap image in my comment above (which I checked has all the text, in a 300dpi png) rather than my original PDF (which, for the record, is from Microsoft Print To PDF). The same result as before came back: 40% of my text is missing.

Here's the code to create do PyMuPDF on the page image rather than the original PDF page:

doc2 = fitz.Document(stream=img)
page = doc2[0]
tp = page.get_textpage_ocr(
    dpi=300, full=True, language='eng', tessdata='/usr/share/tesseract-ocr/5/tessdata',
)
text = tp.extractTEXT()
text_block = text.replace('\n', '|').replace('||', '|')
print(f"PyMuPDF on page image with get_textpage_ocr at 300dpi: text length: {len(text)}\n{text_block}\n")
JorjMcKie commented 7 months ago

Interesting ideas for sure - thanks for submitting this! As you wrote, given a general PDF page, PyMuPDF behavior is quite flexible in terms of OCRing either the full page or only the images on it, and accept any standard text as is on the page. For a full page OCR the DPI value makes sense - although only to the extent of the inherent resolution of the image that represent the PDF page. In such a case (scanned document), extracting that image and letting it OCR probably delivers the best recognition rate possible - except potentially using PSM.

At this point I should mention that it actually is our base library MuPDF that does the Tesseract communication. MuPDF would have to offer specifying that parameter and hand it through to Tesseract. PyMuPDF is unable to do anything on its own here.

May I suggest discussing options directly with the MuPDF colleagues? They are just a click away at our sister MuPDF Discord channel.

stevesimmons commented 7 months ago

Stepping back from the detail, I'm surprised (in a way that I rarely am with PyMuPDF!) that get_textpage_ocr is missing big chunks of clear text from my PDF... I'll raise in the MuPDF discord as you suggest, and post the end result back here to close off the issue.

JorjMcKie commented 7 months ago

Stepping back from the detail, I'm surprised (in a way that I rarely am with PyMuPDF!) that get_textpage_ocr is missing big chunks of clear text from my PDF... I'll raise in the MuPDF discord as you suggest, and post the end result back here to close off the issue.

Well, that hurts. Can you let me have an example?

JorjMcKie commented 7 months ago

Submitted enhancement request to the MuPDF team here.