robertknight / ocrs

Rust library and CLI tool for OCR (extracting text from images)
Apache License 2.0
1.09k stars 44 forks source link

Is PDF / DOCX support on the roadmap? #37

Open wdoppenberg opened 6 months ago

wdoppenberg commented 6 months ago

I know this is not trivial since I've been unsuccessful in finding any PDF->image Rust library, but is this something you plan on supporting in the future?

If help is needed please let me know.

robertknight commented 6 months ago

Ocrs could potentially integrate with existing libraries or CLI tools for rendering PDFs somehow. It could also serve as a backend for a project like OCRmyPDF. What use case did you have in mind?

tomtom215 commented 4 months ago

I would love PDF support as I need to batch load invoice PDF's and extra the text data which can then be saved as JSON to a DB

woidda commented 4 months ago

@tomtom215 fwiw, you can try to preprocess your pdfs with pdf2image which works pretty well.