jlegewie / zotfile

Zotero plugin to manage your attachments: automatically rename, move, and attach PDFs (or other files) to Zotero items, sync PDFs from your Zotero library to your (mobile) PDF reader (e.g. an iPad, Android tablet, etc.), and extract PDF annotations.
4k stars 281 forks source link

OCR pdfs through right click menu #85

Open jlegewie opened 11 years ago

jlegewie commented 11 years ago

add menu item to "OCR PDF File" using free online OCR services or python library

e.g. http://free-online-ocr.com/ https://github.com/Pankrat/pdf-ocr-overlay

jlegewie commented 11 years ago

http://apple.stackexchange.com/questions/76471/make-existing-pdf-searchable-ocr-via-command-line-script https://github.com/Pankrat/pdf-ocr-overlay https://pypi.python.org/pypi/Products.PDFtoOCR/1.1

janbaykara commented 5 years ago

@jlegewie would you accept a pull request that achieves this?

jlegewie commented 5 years ago

Would be a great feature but let's say I am reluctant. First, it depends a little on the implementation. What are your thoughts about that? Second, I basically have no time for zotfile these days and in my experience significant new features create work and bugs down the line that I won't be able to fix. So it might be a better option to implement this as a separate plugin. But again, it would be good to hear about your thoughts on implementation first.

janbaykara commented 5 years ago

The separate plugin might be a smart idea. I guess it'd be simplest to add a right-click option (Scan for readable text or something) that runs this library and overwrites the file.

jlegewie commented 5 years ago

I didn’t look in detail but that requires a binary. Zotete and zotfile both have code for downloading and updating binaries (zotfile’s is mostly copies from zotero). So it would probably be useful to build on that.

trenkert commented 5 years ago

OCRmyPDF is very reliable and uses tesseract. It would be also great to include it as an automatic option when lookup of pdf metadata fails: "no ocr text found" -> run ocrmypdf automatically and rerun metadata lookup.

psyguy commented 4 years ago

UB-Mannheim/zotero-ocr is a Zotero plugin to OCR pdfs using Tesseract.

(I have not used it myself, though.)