jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

[BUG] Consumer doesn't recover from crash #853

Open skuzzle opened 3 years ago

skuzzle commented 3 years ago

Describe the bug It seems that the consumer or doesn't recover from a crash during processing of a PDF (See the logs below). Interestingly, after a restart of the application, it consumes the documents without a problem.

Maybe the initial crash is caused by my current document replication setup:

  1. My scanner saves the PDF to a local network smb share
  2. A cron job on a local server rsyncs the pdfs every 5 minutes to a remote server on which paperless runs

Maybe the rsync picked up an incomplete pdf because it was just being created by the scanner. Though I should improve this process I'd expect paperless to recover from a consumption error and maybe even put the document in question aside so that it doesn't keep on trying to consume it forerver.

Webserver logs

[2021-03-24 07:25:44,766] [ERROR] [paperless.consumer] Error while consuming document scan080735.pdf: LeptonicaError: [2021-03-24 07:25:21,966] [ERROR] [ocrmypdf._exec.tesseract] [tesseract] Error during processing.
[2021-03-24 07:25:45,204] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/scan124856.pdf: File not found.
[2021-03-24 07:25:45,659] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/scan080530.pdf: File not found.
[2021-03-24 07:25:46,248] [INFO] [paperless.consumer] Consuming scan143749.pdf
[2021-03-24 07:25:46,252] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-03-24 07:25:46,400] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-03-24 07:25:46,411] [DEBUG] [paperless.consumer] Parsing scan143749.pdf...
[2021-03-24 07:25:46,475] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/scan143749.pdf
[2021-03-24 07:25:46,721] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/scan143749.pdf', 'output_file': '/tmp/paperless/paperless-cbnv8d2n/archive.pdf', 'use_threads': True, 'jobs': 4, 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-cbnv8d2n/sidecar.txt'}
[2021-03-24 07:25:47,647] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-cbnv8d2n
[2021-03-24 07:25:47,658] [ERROR] [paperless.consumer] Error while consuming document scan143749.pdf: ValueError: Number of processes must be at least 1

After that message I see that a couple of new documents are added to the task queue but they are never processed until I restart the application.

[2021-03-28 13:12:39,090] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/src/../consume/scan150849.pdf to remain unmodified
[2021-03-28 13:12:39,072] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/src/../consume/scan150915.pdf to remain unmodified
[2021-03-28 13:12:44,519] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/scan150915.pdf to the task queue.

Relevant information

jonaswinkler commented 3 years ago

If paperless stops processing files entirely after this error, this might actually be an issue with the task queue. Sadly I've failed to print useful debugging information at this point. This is something I'll change with the next version. Without that, I really can't say all that much.

So I'd propose to wait for the next release (just need some more testing and waiting on one of the open PRs), and post the error logs if it occurs again on 1.4.0.

skuzzle commented 3 years ago

This is the log entry when consuming such a bogus file. After that, the consumer won't continue to pick up furhter documents until a service restart. Edit: it seems that the consumer actually does continue to do its work. At least in some cases. I'll have to observe this furhter.

[2021-04-16 06:19:54,499] [ERROR] [paperless.consumer] Error while consuming document scan175129.pdf: ValueError: Number of processes must be at least 1

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 232, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 326, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)

  File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 373, in run_pipeline

    exec_concurrent(context)

  File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 285, in exec_concurrent

    task_finished=update_page,

  File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_concurrent.py", line 108, in exec_progress_pool

    initargs=(log_queue, task_initializer, logging.getLogger("").level),

  File "/usr/local/lib/python3.7/multiprocessing/dummy/__init__.py", line 124, in Pool

    return ThreadPool(processes, initializer, initargs)

  File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 802, in __init__

    Pool.__init__(self, processes, initializer, initargs)

  File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 169, in __init__

    raise ValueError("Number of processes must be at least 1")

ValueError: Number of processes must be at least 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file

    document_parser.parse(self.path, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 281, in parse

    raise ParseError(f"{e.__class__.__name__}: {str(e)}")

documents.parsers.ParseError: ValueError: Number of processes must be at least 1
mrrodge2020 commented 2 years ago

I'm having a similar issue, but other files do process. Mine seems to be related to the scanner - if the file is in use (scanner still writing data) I get the following in the logs:


[2021-10-14 13:14:17,445] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/src/../consume/scan_20211014120914.pdf to remain unmodified

[2021-10-14 13:14:22,457] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/scan_20211014120914.pdf to the task queue.

But then the documents never get picked up. Adding different files after this process OK, just not ones that were still being written to when discovered.

I'm on 1.5.0.

svenoone commented 2 years ago

I can confirm the same issue with my scanner Canon MB5150 / MB5155 / MB5450. When the scanner is writing the documemt it writes it page by page.

If I use the filesystem polling of paperless it immediately consume the first page, because the scanner needs time scan and write the additional pages. If I use the scheduled consumer all 30 seconds it fails with the message [2022-02-18 10:52:12,546] [ERROR] [paperless.management.consumer] Timeout while waiting on file /usr/src/paperless/src/../consume/SCN_0002.pdf to remain unmodified.

If more details are needed I can support you, just ping me.

guth007 commented 2 years ago

I have the same problem. As I don't have time to find the problem in paperless, i have a little workaround which may help :