madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.76k stars 715 forks source link

allow multiple output #511

Closed badGarnet closed 1 year ago

badGarnet commented 1 year ago

summary

This PR resolves #304 by adding a new function run_and_get_multiple_output that can take multiple extensions (output formats) and return them after one invocation of tesseract. This saves compute time when the user tries to get multiple outputs from one input, e.g.,

text, pdf = run_and_get_multiple_output(image, extensions=['txt', 'pdf'])

walkthrough

The main addition in this PR is the function run_and_get_multiple_output. It accepts a list of extensions like ['pdf', 'txt']. Internally this function:

  1. assembles the command line config arguments needed by mapping each extension to its required config arguments (stored as a constant in EXTENTION_TO_CONFIG).
  2. invokes tesseract just once to generate all the files needed
  3. for each extension load its result and return in the same order as in the input extensions

Note that this PR only allows a subset of all supported extensions. This is to limit the config to those that are compatible to assemble. E.g., the extension osd requires a different command line param --psm instead of -c therefore is not supported yet by this new function.

This PR refactors the function run_tesseract so it can handle multiple extensions: the key change is to filter out extensions that do not need to be appended to the command line arguments.

This PR also refactors the code that reads the output into a helper _read_output so it can be reused by both the new run_and_get_multiple_output and existing run_and_get_output.

test

This PR adds a unit test to test a few combinations of different extension lists. I'd encourage the reviewer to run the function locally with a simple example of

text, boxes = run_and_get_multiple_output(image, extensions=['txt', 'box'])

and compare its runtime to

text = image_to_string(image)
boxes = image_to_box(image)

The above example can a common usage pattern for followup analysis on the OCR results.