houqp / leptess

Productive and safe Rust binding for leptonica and tesseract
https://houqp.github.io/leptess/leptess/index.html
MIT License
258 stars 28 forks source link

Multi-page support (TIFF) #43

Open darklajid opened 2 years ago

darklajid commented 2 years ago

Hey.

Most OCR work I've seen so far uses (b/w, CCITT compressed) multi-page documents. I'd like to make these work with leptess, but it seems (unless I'm missing something?) that there's only support for Pix (not: PixA), nor a mapping for direct TIFF I/O (say pixaReadMultipageTiff from Leptonica). The high level wrapper (leptess:LepTess) also doesn't expose a method to directly set_image a Pix, but that would be the most trivial thing to change.

In other words: I was hoping for a Rust (leptess) workflow that allows

Is that something you'd be willing to support? Am I missing a way how this would work today already? I could offer to look into this, but I admit that I'm a Rust beginner at this point in time.

ccouzens commented 2 years ago

Hey, I might be able to look at this but it wouldn't be until next weekend

I think this might be possible today using set_image_from_mem and the image crate but I haven't tried it.

Some notes for myself: https://tpgit.github.io/Leptonica/pix_8h_source.html#l00363 https://github.com/DanBloomberg/leptonica/blob/5aaf1c187deeef7f47288c6b0833a07021940da7/src/tiffiostub.c#L99-L103

darklajid commented 2 years ago

Thanks a ton for the reply. Looking at the linked image crate / into_bytes it probably should NOT copy for this to be a decent workaround? Otherwise my naive understanding is that the image would be read once, then copied for each page (and .. anyway already re-read by leptonica).

Leptonica does provide the required functionality already, right? PixA is a collection of Pix/an "A"rray of Pix that allows access to the individual entries (which could be passed to tess_api.set_image directly, if that would be exposed in the high level LepTess: This is already what's happening in set_image_from_mem anyway: Reading a buffer into a Pix, then handing that to tesseract.

My armchair idea - and I would be willing to help where I can - is therefore that

In this case there would be no need for another crate and it would probably avoid re-reading (and potentially copying) the image(s) around?

ccouzens commented 2 years ago

Hi,

I haven't forgotten about this.

I'm going to try and get to this step tonight

the plumbing/wrapper/glue should expose PixA (maybe even as an iterator, but even just accessing the count and the entries first, like a read-only implementation to reduce the work required?)

ccouzens commented 2 years ago

You may be interested in this PR. Github won't let me assign you as a reviewer.

https://github.com/ccouzens/leptonica-plumbing/pull/2