paperless-ngx / paperless-ngx

A community-supported supercharged version of paperless: scan, index and archive all your physical documents
https://docs.paperless-ngx.com
GNU General Public License v3.0
21.94k stars 1.2k forks source link

[BUG] Task documents.traks.consume_file raised unexpected OS Error #3787

Closed DanubeRS closed 1 year ago

DanubeRS commented 1 year ago

Description

Unable to consume documents with PDF barcode split pages. Reports error:

paperless-webserver-1  | [2023-07-11 21:26:54,501] [ERROR] [celery.app.trace] Task documents.tasks.consume_file[5643f69f-9cde-4138-934f-083722125e68] raised unexpected: OSError(5, 'Input/output error')
paperless-webserver-1  | Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 477, in trace_task
paperless-webserver-1  |     R = retval = fun(*args, **kwargs)
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 760, in __protected_call__
paperless-webserver-1  |     return self.run(*args, **kwargs)
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/tasks.py", line 99, in consume_file
paperless-webserver-1  |     if settings.CONSUMER_ENABLE_BARCODES and reader.separate(
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/barcodes.py", line 345, in separate
paperless-webserver-1  |     doc_paths = self.separate_pages(separator_pages)
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/barcodes.py", line 309, in separate_pages
paperless-webserver-1  |     shutil.copystat(self.file, savepath)
paperless-webserver-1  |   File "/usr/local/lib/python3.9/shutil.py", line 388, in copystat
paperless-webserver-1  |     _copyxattr(src, dst, follow_symlinks=follow)
paperless-webserver-1  |   File "/usr/local/lib/python3.9/shutil.py", line 330, in _copyxattr
paperless-webserver-1  |     names = os.listxattr(src, follow_symlinks=follow_symlinks)
paperless-webserver-1  | OSError: [Errno 5] Input/output error: '/usr/src/paperless/consume/BRNB422003F2683_000455.pdf'

image

Documents are being sent from Brother MFC, to CIFS share. Documents without barcode splitter are consumed fine, just those with a PDF split page. The document appears uncorrupted when opened in PDF browser.

The document is sensitive, but can provide a sample if required.

Steps to reproduce

  1. Configure barcode split
  2. Scan to CIFS share from MFC
  3. Document is detected, and attempts to read
  4. Document fails to process, IO error. Document remains in share.

Webserver logs

[2023-07-11 03:45:18,358] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/consume/BRNB422003F2683_000459.pdf to the task queue.

[2023-07-11 03:45:18,379] [INFO] [paperless.management.consumer] Polling directory for changes: /usr/src/paperless/consume

[2023-07-11 03:45:18,687] [DEBUG] [paperless.barcodes] Scanning for barcodes using PYZBAR

[2023-07-11 03:45:20,255] [DEBUG] [paperless.barcodes] Barcode of type I25 found: 070590289801

[2023-07-11 03:45:20,595] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:20,596] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:20,596] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:20,596] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:20,596] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:20,596] [DEBUG] [paperless.barcodes] Barcode of type CODE128 found: ADAR-NEXTDOC

[2023-07-11 03:45:20,891] [DEBUG] [paperless.barcodes] Barcode of type I25 found: 070590289803

[2023-07-11 03:45:21,145] [DEBUG] [paperless.barcodes] Barcode of type I25 found: 070590289804

[2023-07-11 03:45:21,171] [DEBUG] [paperless.barcodes] Starting new document at idx 1

[2023-07-11 03:45:21,172] [DEBUG] [paperless.barcodes] Split into 2 new documents

[2023-07-11 03:45:21,173] [DEBUG] [paperless.barcodes] pdf no:0 has 1 pages

[2023-07-11 03:45:21,441] [DEBUG] [paperless.barcodes] Scanning for barcodes using PYZBAR

[2023-07-11 03:45:23,078] [DEBUG] [paperless.barcodes] Barcode of type I25 found: 070590289801

[2023-07-11 03:45:23,410] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:23,410] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:23,410] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:23,410] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:23,410] [DEBUG] [paperless.barcodes] Barcode of type QRCODE found: ADAR-NEXTDOC

[2023-07-11 03:45:23,411] [DEBUG] [paperless.barcodes] Barcode of type CODE128 found: ADAR-NEXTDOC

[2023-07-11 03:45:23,712] [DEBUG] [paperless.barcodes] Barcode of type I25 found: 070590289803

[2023-07-11 03:45:23,973] [DEBUG] [paperless.barcodes] Barcode of type I25 found: 070590289804

[2023-07-11 03:45:23,998] [DEBUG] [paperless.barcodes] Starting new document at idx 1

[2023-07-11 03:45:23,998] [DEBUG] [paperless.barcodes] Split into 2 new documents

[2023-07-11 03:45:24,007] [DEBUG] [paperless.barcodes] pdf no:0 has 1 pages

[2023-07-11 04:05:00,965] [DEBUG] [paperless.classifier] Gathering data from database...

[2023-07-11 04:05:01,545] [DEBUG] [paperless.classifier] 202 documents, 1 tag(s), 76 correspondent(s), 26 document type(s). 0 storage path(es)

[2023-07-11 04:05:01,545] [DEBUG] [paperless.classifier] Vectorizing data...

[2023-07-11 04:05:03,535] [DEBUG] [paperless.classifier] Training tags classifier...

[2023-07-11 04:05:04,419] [DEBUG] [paperless.classifier] Training correspondent classifier...

[2023-07-11 04:05:06,367] [DEBUG] [paperless.classifier] Training document type classifier...

[2023-07-11 04:05:07,744] [DEBUG] [paperless.classifier] There are no storage paths. Not training storage path classifier.

Browser logs

No response

Paperless-ngx version

1.16.5

Host OS

Ubuntu 20.04

Installation method

Docker - official image

Browser

No response

Configuration changes

No response

Other

No response

shamoon commented 1 year ago

I'd bet this is related to your CIFS setup, that's a disk I/O error so could also be permissions. Presumably it's not your hardware

Not sure this is really a paperless bug, AFAIK the feature works fine in general

DanubeRS commented 1 year ago

Does the split perform any write to the consume directory or modify the original file in place @shamoon?

All files in the dir are 755, so I think the path of CIFS is more likely the culprit here. I thought it was curious that normal file ingestion worked fine, but whenever a split was involved this error consistently came about.

shamoon commented 1 year ago

I believe when PDFs are being split on barcodes temp files are created. I will note that https://github.com/paperless-ngx/paperless-ngx/pull/3551 updated this process to try and preserve more metadata, its copystat thats complaining

DanubeRS commented 1 year ago

I updated my /etc/fstab to use NFS over CIFS, and the split functions as intended.

Thanks for the feedback.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.