Closed parkerhancock closed 7 years ago
Oh the CLI supports multi-page input. You can put more than one -i input.tif output.hocr
option to recognize multiple pages. It doesn't support serializing multiple input files into a single output file tough (ALTO doesn't even allow it) and I haven't seen any multi-page hOCR files in the wild.
One of the issues with the glob-like input I wanted to avoid is the inability to explicitly define inputs and outputs. The current syntax is rather verbose but I frankly haven't found a better way, yet.
There should be autogenerated (from docstrings) API documentations on http://kraken.re. The whole shebang is basically calling binarization.nlbin, pageseg.segment, lib.models.load_any (loading the model), and feeding everything into rpred.rpred which return a iterator over all the lines.
I haven't seen any multi-page hOCR files in the wild.
Tesseract supports multi-page tiff / list of files of any image type*. It outputs to a single txt/pdf/hocr.
* Any format that Leptonica supports.
Here is a related issue: https://github.com/tesseract-ocr/tesseract/issues/928
Related in ocr-fileformat-samples
: https://github.com/kba/ocr-fileformat-samples/pull/8
I know that tesseract and ocropy can do so, I just haven't seen any non-1-page-per-file documents in the wild (as from libraries etc.).
E.g. https://archive.org/details/siopsecretusplan0000prin provides a multi-page (ziped) ABBYY file.
However, it is also possible to use hocr-combine for merging several hocr files together in one hocr file afterwards.
Thanks for the responses! I'll try the function calls you mentioned to just integrate it directly.
And yes, I saw in the docs that you could pass multiple input documents to the CLI, but the docs I'm dealing with are 30+ pages long - so it's not that practical. I mean, I suppose I could throw all the tiffs into the CLI with a call to subprocess.run, but it feels cumbersome. So, I still think that glob strings would be a useful feature.
Friendly suggestion - I love that Kraken supports python3, and is fairly lightweight on dependencies, but what is starting to be a deal-breaker is lack of support for multiple page input. My workflow (which I suspect is fairly common), is to take a .PDF, split into Group 4 Tiffs, and then OCR the tiff images into a hocr document (and then on to NLP-land)
Ocropy handles glob characters (? and * wildcards) to handle multiple pages of input, and can generate a consolidated hocr file for the whole document. As far as I (and probably a lot of people) are concerned, these are must-have features.
So, for your consideration, I'd recommend either (1) allowing the CLI to accept glob-like input, or (2) build/document an API to use it in Python code without the CLI for multiple page documents.
Maybe 2 already exists in some form or fashion, with some selective imports/etc. But on cursory review, it's tough to pick out the pieces.
Thanks!