jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 357 forks source link

[BUG] Some documents fail to be consumed due to "Device or resource busy" #1406

Open denilsonsa opened 2 years ago

denilsonsa commented 2 years ago

Describe the bug

About 3.5% of the documents fail to be consumed due to OSError: [Errno 16] Device or resource busy

My system - short version

Paperless-ng 1.5.0, running on docker on Linux arm64, with the consumption directory pointing to a mounted SMB/CIFS share.

My system - long version

I have a shared folder Scans, and inside it I have a folder Scans/papeless/consume. The scanner can write the scanned PDFs directly to that folder. That folder is also mounted on the host Linux system running on the Raspberry Pi (through /etc/fstab with fstype cifs and options nobrl). All the paperless-related files are inside Scans/papeless/*.

Since the consumption directory is a shared network mount, I can't use filesystem notifications. Instead, I have PAPERLESS_CONSUMER_POLLING=30 set on docker-compose.env.

To Reproduce

  1. Scan a lot of documents onto the consumption directory.
  2. Randomly, get unlucky (as I said, it happens rarely).
  3. Wait many minutes.
  4. Observe most of the documents were processed just fine, but one document was left in the consumption folder.

Expected behavior

I expected all documents to be consumed correctly.

If that wasn't possible, I expected some error message at the paperless-ng dashboard. Such error should have a button to "retry" consuming the same document.

Even better, I expected paperless-ng to retry at least a couple of times by its own, without human intervention. (Or at least retry for a certain subset of errors.)

Webserver logs

These are the only two lines regarding the document that failed:

[2021-10-22 16:10:39,620] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/src/../consume/0122_211022161107_001.pdf to remain unmodified

[2021-10-22 16:10:44,638] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/0122_211022161107_001.pdf to the task queue.

All the other lines are from other documents that were processed just fine.

Django Q - Failed tasks

Educated guess

I assume the scanner takes a few seconds to write the full file to the consumption directory. If paperless-ng is polling the file, it may happen it detects the file before the file has finished writing. But then I know paperless-ng has built-in logic to wait for the file to not have any changes before start processing; that's good, but maybe this logic is a bit flawed. It may happen the file didn't have any change, but was still locked by the scanner because the scanner was still writing to it. So, when the task tries to process the file, it fails due to the resource being busy.

Proposed solutions

Just some ideas worth investigating, in the order of my preference.

denilsonsa commented 2 years ago

Django Q documentation says:

You can resubmit a failed task back to the queue using the admins action menu.

So, to retry a failed task:

  1. Click on "Admin" at the paperless dashboard sidebar.
  2. Click on "Failed tasks", under "Django Q".
    • Alternatively, go directly to /admin/django_q/failure/
  3. Select the tasks you want to retry (by clicking on their checkboxes).
  4. Select "Resubmit selected tasks to queue" at the "Action" drop-down.
  5. Click "Go".

Screenshot of the Django Q admin interface

It would be great to have this added to the Paperless-ng documentation. (Even better if such tasks wouldn't fail in the first place; see my proposed solutions.)

(Sidenote: now I wonder if I should just delete all the old failed tasks from the admin interface. Would that cause any trouble?)