mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
744 stars 131 forks source link

Multi-Page Input for the CLI #43

Closed parkerhancock closed 7 years ago

parkerhancock commented 7 years ago

Friendly suggestion - I love that Kraken supports python3, and is fairly lightweight on dependencies, but what is starting to be a deal-breaker is lack of support for multiple page input. My workflow (which I suspect is fairly common), is to take a .PDF, split into Group 4 Tiffs, and then OCR the tiff images into a hocr document (and then on to NLP-land)

Ocropy handles glob characters (? and * wildcards) to handle multiple pages of input, and can generate a consolidated hocr file for the whole document. As far as I (and probably a lot of people) are concerned, these are must-have features.

So, for your consideration, I'd recommend either (1) allowing the CLI to accept glob-like input, or (2) build/document an API to use it in Python code without the CLI for multiple page documents.

Maybe 2 already exists in some form or fashion, with some selective imports/etc. But on cursory review, it's tough to pick out the pieces.

Thanks!

mittagessen commented 7 years ago

Oh the CLI supports multi-page input. You can put more than one -i input.tif output.hocr option to recognize multiple pages. It doesn't support serializing multiple input files into a single output file tough (ALTO doesn't even allow it) and I haven't seen any multi-page hOCR files in the wild.

One of the issues with the glob-like input I wanted to avoid is the inability to explicitly define inputs and outputs. The current syntax is rather verbose but I frankly haven't found a better way, yet.

There should be autogenerated (from docstrings) API documentations on http://kraken.re. The whole shebang is basically calling binarization.nlbin, pageseg.segment, lib.models.load_any (loading the model), and feeding everything into rpred.rpred which return a iterator over all the lines.

amitdo commented 7 years ago

I haven't seen any multi-page hOCR files in the wild.

Tesseract supports multi-page tiff / list of files of any image type*. It outputs to a single txt/pdf/hocr.

* Any format that Leptonica supports.

Here is a related issue: https://github.com/tesseract-ocr/tesseract/issues/928

zuphilip commented 7 years ago

Related in ocr-fileformat-samples: https://github.com/kba/ocr-fileformat-samples/pull/8

mittagessen commented 7 years ago

I know that tesseract and ocropy can do so, I just haven't seen any non-1-page-per-file documents in the wild (as from libraries etc.).

zuphilip commented 7 years ago

E.g. https://archive.org/details/siopsecretusplan0000prin provides a multi-page (ziped) ABBYY file.

However, it is also possible to use hocr-combine for merging several hocr files together in one hocr file afterwards.

parkerhancock commented 7 years ago

Thanks for the responses! I'll try the function calls you mentioned to just integrate it directly.

And yes, I saw in the docs that you could pass multiple input documents to the CLI, but the docs I'm dealing with are 30+ pages long - so it's not that practical. I mean, I suppose I could throw all the tiffs into the CLI with a call to subprocess.run, but it feels cumbersome. So, I still think that glob strings would be a useful feature.