houqp / leptess

Productive and safe Rust binding for leptonica and tesseract
https://houqp.github.io/leptess/leptess/index.html
MIT License
258 stars 28 forks source link

Output to PDF #45

Closed Innominus closed 2 years ago

Innominus commented 2 years ago

Hi all.

I'm working on a project to ingest PDFs, turn them into images, OCR them with Tesseract, and then output them to a PDF. I've done this with Python, where translating it to an image, and getting a PDF with readable text output from Tesseract was a little simpler, of course. I'm trying to do the same thing with Rust, and using the Leptess library. My problem is there doesn't seem to be high-level API's exposed to accommodate PDF exporting after the document has been OCR'd, and looking at the C API's just further confuses me, and I'm unsure where to start.

Would it be possible to get high-level API's added to Leptess in order to accommodate exporting of PDF's, or would you have a code snippet of how to use the C API's to export to PDF?

Thanks!

Innominus commented 2 years ago

Leaving another note on this one, I don't just want to write the PDF to disk, I have multiple PDFs that need to be OCR'd then output again as a PDF then combined. So ideally I can get back the PDF as bytes.

ccouzens commented 2 years ago

Hi @Innominus ,

Sorry for being slow to respond. I've been busy recently.

From what I can see there isn't a particular c API that converts to a PDF.

Probably what other solutions are doing is using a method like get_lstm_box_text and working out where text is within the original PDF. Then manipulating the original PDF to have text boxes at those areas.

I suspect this would be best done as a separate project in a separate crate.

If you think tesseract's library can probably do this in a more direct manner, let me know and I'll try and look further.

Sorry this isn't a particularly helpful answer. Best of luck,

Chris