Closed proxseas closed 4 years ago
Edit: it looks like embedding the OCR'd text back into the PDF is not in scope for this project. I really think this should be documented clearly somewhere - perhaps in the project's GitHub home page?
Edit: it looks like embedding the OCR'd text back into the PDF is not in scope for this project. I really think this should be documented clearly somewhere - perhaps in the project's GitHub home page?
Thanks for the feedback. We are open for pull requests, I've you like to contribute.
@proxseas Please let me know if you find a solution. I just got Paperless up and running and was surprised to see the OCR results aren't integrated back into the PDF.
@nirvdrum @Tooa I started using OCRmyPDF ([link](https://github.com/jbarlow83/OCRmyPDF to get around this shortcoming. I think it has slightly less functionality, but it has this key feature that I want - writing text back to the PDF. I highly recommend checking it out.
@proxseas Thanks for the info. Are you running that as a post-processing step? Or did you stop using paperless entirely?
@nirvdrum I stopped using paperless entirely. I actually run ocrMyPDF manually from commandline on all the files that need it, but I could easily set this up in a more automated way with 'inotifywait' on Linux or whatever is the equivalent on Windows. Ultimately, I decided that paperless is significantly an overkill for me and not having text in the PDF is just one more annoyance. I figured that if I have a good file/directory structure, then I’ve won half the battle. I can search for the file or directory name. Then you have searching for the actual contents of the PDFs using PDFGrep or one of many GUI full-text search applications like Agent Ransack on Windows or DocFetcher on Linux. Between these several approaches, I should be able to find any document very quickly. Finally, there is tagging. It’s a low priority for me, but I found this neat application called "TagSpaces", whose free version might suffice. I suppose what I really like about the approach I ultimately took is that I have just plain files sitting on my hard drive. No magic. No all-encompassing program that takes many incantations to get working (even if I ultimately figured out how to run paperless). I think it's too easy to overcomplicate this whole thing and that offerings like paperless, Mayan EMDS, or even the super impressive Teedy.IO are simply an overkill when the primary consumer of the documents I'm organizing is myself. If you have multiple people, especially less tech savvy ones, then setting up these fancier applications might be worthwhile (so your end users don't have to type any scary commands). Hope this helped.
Hi @proxseas you should try https://github.com/cmccambridge/ocrmypdf-auto A docker container that does exactly what are you doing manually. I have this in front of my paperless server, since the OCR process is much faster and more acurate then one implementet in paperless. (the ocr output folder is the paperless input) Its also a good way to dont keep paperless so busy and looking the database while processing the file. :)
It seems that OCR is actually being performed correctly, but the resulting PDF(s) do not have any text in them. For the time being, I'm simply invoking the consumer part of the script, like so:
The consumer picks up any new files and seems to process them correctly. Here is some recent output:
Now I have in my 'media' dir this PDF: 0000002.pdf, which contains NO OCR'd text and it's 100% identical to the input file.
As a test, I've went on to run the exporter, like so:
docker exec -it paperless-consumer ./manage.py document_exporter /export
This successfully exports all of paperless' documents into my locally mapped export dir. These files are:
So it seems like maybe there is a quietly failing part that fails to embed the OCR'd text into the output PDF files? For context, here is the contents of a sample 'manifest.json' file:
Please advise! I've spent a lot of time trying to get this working and I think I'm close, but it's not there just yet.