jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

[BUG] Large Documents Do Not Get Consumed #1469

Open evilsidekick293 opened 2 years ago

evilsidekick293 commented 2 years ago

Documents do not get consumed/treated above a certain file size Importing a lot of scanned documents (400dpi) into paperless works almost always fine. The largest one was around 51MB in size. That worked. But two larger files (91MB and 127MB) cannot be imported. Neither by copying them into the consume folder on my NAS (Synology DS720+ with paperless as docker container), nor by dragging it onto the web-gui field.

To Reproduce Steps to reproduce the behavior:

  1. Go to web-gui
  2. Drag and drop a large file (around 90MB should work) onto the import field on the right. It also does not matter if there are already files being worked on or just this one large file.
  3. Wait a lot
  4. See that nothing appears as new in paperless

Expected behavior Well having it being imported would be nice.

Webserver logs The paperless logs are somewhat useless, I guess. This is, what I got:

[2021-12-05 13:14:10,043] [INFO] [paperless.management.consumer] Using inotify to watch directory for changes: /data/consume
[2021-12-05 13:14:22,228] [INFO] [paperless.management.consumer] Adding /data/consume/some_file.pdf to the task queue.
[2021-12-05 13:14:22,591] [INFO] [paperless.consumer] Consuming some_file.pdf
[2021-12-05 13:14:22,595] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-12-05 13:14:22,605] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-12-05 13:14:22,609] [DEBUG] [paperless.consumer] Parsing some_file.pdf...
[2021-12-05 13:14:33,771] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /data/consume/some_file.pdf
[2021-12-05 13:14:33,956] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/data/consume/some_file.pdf', 'output_file': '/tmp/paperless/paperless-uzp2xhs6/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-uzp2xhs6/sidecar.txt'}

Relevant information

evilsidekick293 commented 2 years ago

Running on my Ubuntu this:

ocrmypdf --output-file pdfa --sidecar some_file_sidecar.txt -s -j 2 -c -d -r --rotate-pages-threshold 12.0 some_file.pdf some_file_ocr.pdf

This yields a usable result after some 30min.

evilsidekick293 commented 2 years ago

Well apparently the worker gets shutdown for some timeout reason (see this container log):

[cont-init.d] 70-aliases: exited 1.
[cont-init.d] 90-custom-folders: executing... 
[cont-init.d] 90-custom-folders: exited 0.
[cont-init.d] 99-custom-scripts: executing... 
[custom-init] no custom files found exiting...
[cont-init.d] 99-custom-scripts: exited 0.
[cont-init.d] done.
[services.d] starting services
Waiting for redis to become available...
Waiting for redis to become available...
Waiting for redis to become available...
[services.d] done.
350:C 05 Dec 2021 15:19:42.648 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
350:C 05 Dec 2021 15:19:42.648 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=350, just started
350:C 05 Dec 2021 15:19:42.648 # Configuration loaded
350:M 05 Dec 2021 15:19:42.650 * Running mode=standalone, port=6379.
350:M 05 Dec 2021 15:19:42.650 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
350:M 05 Dec 2021 15:19:42.650 # Server initialized
350:M 05 Dec 2021 15:19:42.650 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
350:M 05 Dec 2021 15:19:42.650 * Ready to accept connections
[2021-12-05 14:19:53 +0000] [349] [INFO] Starting gunicorn 20.1.0
[2021-12-05 14:19:53 +0000] [349] [INFO] Listening at: http://0.0.0.0:8000 (349)
[2021-12-05 14:19:53 +0000] [349] [INFO] Using worker: paperless.workers.ConfigurableWorker
[2021-12-05 14:19:53 +0000] [349] [INFO] Server is ready. Spawning workers
[2021-12-05 14:19:53,838] [INFO] [paperless.management.consumer] Adding /data/consume/huk_haftpflicht.pdf to the task queue.
14:19:53 [Q] INFO Enqueued 1
[2021-12-05 14:19:53,846] [INFO] [paperless.management.consumer] Using inotify to watch directory for changes: /data/consume
14:19:53 [Q] INFO Q Cluster mike-solar-ten-thirteen starting.
14:19:53 [Q] INFO Process-1:1 ready for work at 398
14:19:53 [Q] INFO Process-1:2 ready for work at 399
14:19:53 [Q] INFO Process-1:3 monitoring at 400
14:19:53 [Q] INFO Process-1 guarding cluster mike-solar-ten-thirteen
14:19:53 [Q] INFO Process-1:4 pushing tasks at 401
14:19:53 [Q] INFO Q Cluster mike-solar-ten-thirteen running.
14:19:53 [Q] INFO Process-1:1 processing [huk_haftpflicht.pdf]
[2021-12-05 14:19:54,260] [INFO] [paperless.consumer] Consuming huk_haftpflicht.pdf
[2021-12-05 14:20:44,400] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:20:47,107] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:21:53 +0000] [349] [CRITICAL] WORKER TIMEOUT (pid:377)
[2021-12-05 14:21:53 +0000] [349] [WARNING] Worker with pid 377 was terminated due to signal 6
[2021-12-05 14:23:10,387] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:24:00,299] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:24:05,026] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:24:55,363] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:25:03,136] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:25:51,568] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
14:25:54 [Q] INFO Process-1:2 processing [nebraska-lamp-lion-fruit]
14:25:54 [Q] INFO Enqueued 1
14:25:54 [Q] INFO Process-1 created a task from schedule [Check all e-mail accounts]
14:25:54 [Q] INFO Process-1:2 stopped doing work
14:25:54 [Q] INFO Processed [nebraska-lamp-lion-fruit]
14:25:55 [Q] INFO recycled worker Process-1:2
14:25:55 [Q] INFO Process-1:5 ready for work at 670
[2021-12-05 14:25:56,406] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:26:47,694] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:26:53,140] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:27:44,303] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:27:52,397] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:28:43,928] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:29:19,721] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:29:36,839] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2021-12-05 14:35:34,485] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
14:35:56 [Q] INFO Enqueued 1
14:35:56 [Q] INFO Process-1 created a task from schedule [Check all e-mail accounts]
14:35:56 [Q] INFO Process-1:5 processing [table-oxygen-colorado-juliet]
14:35:56 [Q] INFO Process-1:5 stopped doing work
14:35:56 [Q] INFO Processed [table-oxygen-colorado-juliet]
14:35:56 [Q] INFO recycled worker Process-1:5
14:35:56 [Q] INFO Process-1:6 ready for work at 1031
[2021-12-05 14:44:06,742] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
14:45:58 [Q] INFO Enqueued 1
14:45:58 [Q] INFO Process-1 created a task from schedule [Check all e-mail accounts]
evilsidekick293 commented 2 years ago

Importing after running the ocr command (see above) is no issue at all.

evilsidekick293 commented 2 years ago

After some digging around I think I found the issue: The worker timeout is set to 30min by default and there is no way of changing that. Running paperless on a 2core NAS with OCR on quite large files, I think a configurable worker timeout would be appropriate. I created a pull-request making this configurable. I tested it with a new docker build.

siancu commented 2 years ago

Thanks for this PR, I've merged it in my fork as well. I also have a Synology NAS and this might come in handy soon as I have to ingest some documents which are quite big.

Marty commented 2 years ago

I'm running into the same problem. My hardware is quite weak, so a 37mb file already is a problem.

fahrer17 commented 2 years ago

So do I, I have a lot of documents which my Pentium powered NAS can't import, even small ones.

Edit: I run my paperless-ng (what a jewel imho! 🙂) with Docker in Openmediavault. Is there a way I can provide useful log files?

darkmattercoder commented 2 years ago

Same issue here. I scanned a document with 340 printed pages, leading to about 136mb in size. The OCR worker aparrently quit in the middle of the ocr job. But I got no CRITICAL timeout message.

The log in the web interface stops at the ocr command:

[2022-01-16 15:47:17,632] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-g8np8mo1', 'output_file': '/tmp/paperless/paperless-a62i91nw/archive.pdf', 'use_threads': True, 'jobs': 1, 'language': 'deu+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-a62i91nw/sidecar.txt'}

and in my container log I only see: paperless-ng | 16:17:10 [Q] WARNING reincarnated worker Process-1:368 after timeout

alexschroth commented 2 years ago

I had the same error with very similar log output. In my case, insufficient memory was the source of the problem. The Linux kernel's OOM killer kept shooting down the OCR processes, eventually causing the parent process to timeout.

Increasing the memory of the virtual machine running Docker from 2 to 4 GB solved the problem for me. This meant that importing PDFs up to about 60 MB was no longer a problem.

Perhaps the 6 GB of memory mentioned above is not enough. Is it possible for you to run the container (just for testing) on a system that has more RAM available? Alternatively, you could also try to increase the swapfile / swappartition.

For debugging your problem, information about the RAM usage during the analysis of the PDF might be helpful. Ideally, the large PDF to be analyzed should be the only one in the queue.

fahrer17 commented 2 years ago

I created a VM on Virtual Machine Manager running OMV and paperless-ng in Docker with:

CPU: 8x Single Core AMD EPYC-Milan (-SMP-) speed: 3819 MHz 
Kernel: 5.10.0-0.bpo.9-amd64 x86_64 Up: 7m Mem: 1996.1/7954.8 MiB (25.1%)

and tested 3 or 4 documents which failed on my main NAS. All of them were imported within minutes.

Edit: I can provide the documents if this is of any help since these are no personal documents but manuals.