the-paperless-project / paperless

Scan, index, and archive all of your paper documents
GNU General Public License v3.0
7.86k stars 499 forks source link

No text in the output PDF - only in manifest.json #681

Closed proxseas closed 4 years ago

proxseas commented 4 years ago

It seems that OCR is actually being performed correctly, but the resulting PDF(s) do not have any text in them. For the time being, I'm simply invoking the consumer part of the script, like so:

-v /mnt/c/docker_shared_dirs/paperless_export_dir:/export \
-v /mnt/c/docker_shared_dirs/paperless_data:/usr/src/paperless/data \
-v /mnt/c/docker_shared_dirs/paperless_media:/usr/src/paperless/media \
-v /mnt/c/docker_shared_dirs/paperless_consume_dir:/consume \
-e "USERMAP_UID=1000" -e "USERMAP_GID=1000" thepaperlessproject/paperless document_consumer

The consumer picks up any new files and seems to process them correctly. Here is some recent output:

Consuming /consume/reu.pdf
** Processing: /tmp/paperless/paperless-iuh0uxsq/convert.png
410x354 pixels, 3x16 bits/pixel, RGB
Input IDAT size = 236340 bytes
Input file size = 236481 bytes

Trying:
  zc = 9  zm = 9  zs = 0  f = 0         IDAT size = 175188
  zc = 9  zm = 8  zs = 0  f = 0         IDAT size = 175185

Selecting parameters:
  zc = 9  zm = 8  zs = 0  f = 0         IDAT size = 175185

Output file: /tmp/paperless/paperless-iuh0uxsq/optipng.png

Output IDAT size = 175185 bytes (61155 bytes decrease)
Output file size = 175242 bytes (61239 bytes = 25.90% decrease)

Processing sheet #1: /tmp/paperless/paperless-iuh0uxsq/convert-0000.pnm -> /tmp/paperless/paperless-iuh0uxsq/convert-0000.unpaper.pnm
[pgm_pipe @ 0x56269b97fcc0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x56269b981100] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x56269b981100] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
Unable to detect date for document
Completed
Document 20200615181138: reu consumption finished

Now I have in my 'media' dir this PDF: 0000002.pdf, which contains NO OCR'd text and it's 100% identical to the input file.

As a test, I've went on to run the exporter, like so: docker exec -it paperless-consumer ./manage.py document_exporter /export

This successfully exports all of paperless' documents into my locally mapped export dir. These files are:

So it seems like maybe there is a quietly failing part that fails to embed the OCR'd text into the output PDF files? For context, here is the contents of a sample 'manifest.json' file:

[
  {
    "model": "documents.document",
    "pk": 2,
    "fields": {
      "correspondent": null,
      "title": "ni",
      "content": "In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by Paul W. Handel\nwho obtained a US patent on OCR in USA in 1933 (U.S. Patent 1,915,993). In 1935 Tauschek\nwas also granted a US patent on his method (U.S. Patent 2,026,329). Tauschek's machine was a\nmechanical device that used templates and a photodetector.\n\nIn 1949 RCA engineers worked on the first primitive computer-type OCR to help blind people\nfor the US Veterans Administration, but instead of converting the printed characters to machine\nlanguage, their device converted it to machine language and then spoke the letters. It proved far\ntoo expensive and was not pursued after testing.!\"!\n\nIn 1950, David H. Shepard, a cryptanalyst at the Armed Forces Security Agency in the United\nStates, addressed the problem of converting printed messages into machine language for\ncomputer processing and built a machine to do this, reported in the Washington Daily News on\n27 April 1951 and in the New York Times on 26 December 1953 after his U.S. Patent 2,663,758\nwas issued. Shepard then founded Intelligent Machines Research Corporation (IMR), which\nwent on to deliver the world's first several OCR systems used in commercial operation.\n\nIn 1955, the first commercial system was installed at the Reader's Digest. The second system\nwas sold to the Standard Oil Company for reading credit card imprints for billing purposes.\nOther systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell\nTelephone Company and a page scanner to the United States Air Force for reading and\ntransmitting by teletype typewritten messages. IBM and others were later licensed on Shepard's\nOCR patents.\n\nIn about 1965, Reader's Digest and RCA collaborated to build an OCR Document reader\ndesigned to digitise the serial numbers on Reader's Digest coupons returned from advertisements.\nThe fonts used on the documents were printed by an RCA Drum printer using the OCR-A font.\nThe reader was connected directly to an RCA 301 computer (one of the first solid state\ncomputers). This reader was followed by a specialised document reader installed at TWA where\nthe reader processed Airline Ticket stock. The readers processed documents at a rate of 1,500\ndocuments per minute, and checked each document, rejecting those it was not able to process\ncorrectly. The product became part of the RCA product line as a reader designed to process\n\"Turn around Documents\" such as those utility and insurance bills returned with payments.\n\nThe United States Postal Service has been using OCR machines to sort mail since 1965 based on\ntechnology devised primarily by the prolific inventor Jacob Rabinow. The first use of OCR in\nEurope was by the British General Post Office (GPO). In 1965 it began planning an entire\nbanking system, the National Giro, using OCR technology, a process that revolutionized bill\npayment systems in the UK. Canada Post has been using OCR systems since 1971 [\u00a2/@ion needed]\nOCR systems read the name and address of the addressee at the first mechanised sorting center,\nand print a routing bar code on the envelope based on the postal code. To avoid confusion with\nthe human-readable address field which can be located anywhere on the letter, special ink\n(orange in visible light) is used that is clearly visible under ultraviolet light. Envelopes may then\nbe processed with equipment based on simple barcode readers.\n\nIn 1974 Ray Kurzweil started the company Kurzweil Computer Products, Inc. and led\ndevelopment of the first omni-font optical character recognition system \u2014 a computer program\ncapable of recognizing text printed in any normal font. He decided that the best application of\nthis technology would be to create a reading machine for the blind, which would allow blind\npeople to have a computer read text to them out loud. This device required the invention of two\nenabling technologies \u2014 the CCD flatbed scanner and the text-to-speech synthesizer. On\nJanuary 13, 1976 the successful finished product was unveiled during a widely-reported news",
      "file_type": "pdf",
      "checksum": "356c670789fa185ed264cd273d53bced",
      "created": "1951-04-27T00:00:00Z",
      "modified": "2020-06-15T05:21:40.241Z",
      "storage_type": "unencrypted",
      "added": "2020-06-15T05:21:40.171Z",
      "filename": "0000002.pdf",
      "tags": []
    },
    "__exported_file_name__": "19510427000000-ni.pdf",
    "__exported_thumbnail_name__": "19510427000000-ni.pdf-thumbnail.png"
  },
]

Please advise! I've spent a lot of time trying to get this working and I think I'm close, but it's not there just yet.

proxseas commented 4 years ago

Edit: it looks like embedding the OCR'd text back into the PDF is not in scope for this project. I really think this should be documented clearly somewhere - perhaps in the project's GitHub home page?

Tooa commented 4 years ago

Edit: it looks like embedding the OCR'd text back into the PDF is not in scope for this project. I really think this should be documented clearly somewhere - perhaps in the project's GitHub home page?

Thanks for the feedback. We are open for pull requests, I've you like to contribute.

nirvdrum commented 4 years ago

@proxseas Please let me know if you find a solution. I just got Paperless up and running and was surprised to see the OCR results aren't integrated back into the PDF.

proxseas commented 4 years ago

@nirvdrum @Tooa I started using OCRmyPDF ([link](https://github.com/jbarlow83/OCRmyPDF to get around this shortcoming. I think it has slightly less functionality, but it has this key feature that I want - writing text back to the PDF. I highly recommend checking it out.

nirvdrum commented 4 years ago

@proxseas Thanks for the info. Are you running that as a post-processing step? Or did you stop using paperless entirely?

proxseas commented 4 years ago

@nirvdrum I stopped using paperless entirely. I actually run ocrMyPDF manually from commandline on all the files that need it, but I could easily set this up in a more automated way with 'inotifywait' on Linux or whatever is the equivalent on Windows. Ultimately, I decided that paperless is significantly an overkill for me and not having text in the PDF is just one more annoyance. I figured that if I have a good file/directory structure, then I’ve won half the battle. I can search for the file or directory name. Then you have searching for the actual contents of the PDFs using PDFGrep or one of many GUI full-text search applications like Agent Ransack on Windows or DocFetcher on Linux. Between these several approaches, I should be able to find any document very quickly. Finally, there is tagging. It’s a low priority for me, but I found this neat application called "TagSpaces", whose free version might suffice. I suppose what I really like about the approach I ultimately took is that I have just plain files sitting on my hard drive. No magic. No all-encompassing program that takes many incantations to get working (even if I ultimately figured out how to run paperless). I think it's too easy to overcomplicate this whole thing and that offerings like paperless, Mayan EMDS, or even the super impressive Teedy.IO are simply an overkill when the primary consumer of the documents I'm organizing is myself. If you have multiple people, especially less tech savvy ones, then setting up these fancier applications might be worthwhile (so your end users don't have to type any scary commands). Hope this helped.

kohfuchs commented 4 years ago

Hi @proxseas you should try https://github.com/cmccambridge/ocrmypdf-auto A docker container that does exactly what are you doing manually. I have this in front of my paperless server, since the OCR process is much faster and more acurate then one implementet in paperless. (the ocr output folder is the paperless input) Its also a good way to dont keep paperless so busy and looking the database while processing the file. :)