Add OCR for PDFs - Githubissues

simon-knuth / scanner

An all-in-one scanner app built for the Universal Windows Platform

https://simon-knuth.github.io/scanner/index

Mozilla Public License 2.0

487 stars 28 forks source link

Add OCR for PDFs #80

Closed larsschellhas closed 2 years ago

larsschellhas commented 2 years ago

Is your feature request related to a problem? Please describe. PDFs cannot be searched as their text is not added to the PDF through OCR.

Describe the solution you'd like I would like scanned PDFs to have the detected text for the PDF to be searchable and being able to copy text directly from the PDF.

Describe alternatives you've considered There are other apps to convert PDFs to PDFs with the text included.

Additional context This might be a bit beyond the scope that you have planned for your (really awesome!!!) app, but it would still be a real killer feature :)

simon-knuth commented 2 years ago

Getting the text and its positioning isn't a big deal, but the final part seems problematic: If I'm not mistaken, the app would need to add a transparent text layer on top of the scanned text. The letters on the layer would have to be a one-to-one match for text selections/highlights to make sense, which is where I'd expect some major difficulties depending on the font properties.

In any case, I really want to avoid touching the PDF implementation as long as the app is still based on UWP instead of the Windows App SDK, so I probably won't work on this any time soon. Still, thank you very much for the suggestion!

sschuberth commented 2 years ago

Maybe it's worth looking how https://github.com/cyanfish/naps2 does OCR?

larsschellhas commented 1 year ago

@simon-knuth I believe there might be a simpler solution for OCR. I've been looking through a couple of python packages for OCR and they are all using https://github.com/tesseract-ocr/tesseract which is provided as C++ library.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

Tesseract supports various image formats including PNG, JPEG and TIFF.

Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV and ALTO (the last one - since version 4.1.0).

It would be great if this functionality could be incorporated after all - or at least keep this issue open for now :)