Open skuzzle opened 3 years ago
If paperless stops processing files entirely after this error, this might actually be an issue with the task queue. Sadly I've failed to print useful debugging information at this point. This is something I'll change with the next version. Without that, I really can't say all that much.
So I'd propose to wait for the next release (just need some more testing and waiting on one of the open PRs), and post the error logs if it occurs again on 1.4.0.
This is the log entry when consuming such a bogus file. After that, the consumer won't continue to pick up furhter documents until a service restart. Edit: it seems that the consumer actually does continue to do its work. At least in some cases. I'll have to observe this furhter.
[2021-04-16 06:19:54,499] [ERROR] [paperless.consumer] Error while consuming document scan175129.pdf: ValueError: Number of processes must be at least 1
Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 232, in parse
ocrmypdf.ocr(**args)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 326, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 373, in run_pipeline
exec_concurrent(context)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 285, in exec_concurrent
task_finished=update_page,
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_concurrent.py", line 108, in exec_progress_pool
initargs=(log_queue, task_initializer, logging.getLogger("").level),
File "/usr/local/lib/python3.7/multiprocessing/dummy/__init__.py", line 124, in Pool
return ThreadPool(processes, initializer, initargs)
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 802, in __init__
Pool.__init__(self, processes, initializer, initargs)
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 169, in __init__
raise ValueError("Number of processes must be at least 1")
ValueError: Number of processes must be at least 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
document_parser.parse(self.path, mime_type, self.filename)
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 281, in parse
raise ParseError(f"{e.__class__.__name__}: {str(e)}")
documents.parsers.ParseError: ValueError: Number of processes must be at least 1
I'm having a similar issue, but other files do process. Mine seems to be related to the scanner - if the file is in use (scanner still writing data) I get the following in the logs:
[2021-10-14 13:14:17,445] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/src/../consume/scan_20211014120914.pdf to remain unmodified
[2021-10-14 13:14:22,457] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/scan_20211014120914.pdf to the task queue.
But then the documents never get picked up. Adding different files after this process OK, just not ones that were still being written to when discovered.
I'm on 1.5.0.
I can confirm the same issue with my scanner Canon MB5150 / MB5155 / MB5450. When the scanner is writing the documemt it writes it page by page.
If I use the filesystem polling of paperless it immediately consume the first page, because the scanner needs time scan and write the additional pages.
If I use the scheduled consumer all 30 seconds it fails with the message
[2022-02-18 10:52:12,546] [ERROR] [paperless.management.consumer] Timeout while waiting on file /usr/src/paperless/src/../consume/SCN_0002.pdf to remain unmodified.
If more details are needed I can support you, just ping me.
I have the same problem. As I don't have time to find the problem in paperless, i have a little workaround which may help :
Describe the bug It seems that the consumer or doesn't recover from a crash during processing of a PDF (See the logs below). Interestingly, after a restart of the application, it consumes the documents without a problem.
Maybe the initial crash is caused by my current document replication setup:
Maybe the rsync picked up an incomplete pdf because it was just being created by the scanner. Though I should improve this process I'd expect paperless to recover from a consumption error and maybe even put the document in question aside so that it doesn't keep on trying to consume it forerver.
Webserver logs
After that message I see that a couple of new documents are added to the task queue but they are never processed until I restart the application.
Relevant information