[Other] Delay before consume kicks in

jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents

https://paperless-ng.readthedocs.io/en/latest/

GNU General Public License v3.0

5.37k stars 356 forks source link

[Other] Delay before consume kicks in #1286

Open muppie opened 3 years ago

muppie commented 3 years ago

Hi, Is there a way to add a delay before the consume folder kicks in? I upload all documents via Nextcloud and then consume into Paperless-ng but sometimes the file is only partly uploaded before Paperless-ng finds it.

If it not possible yet, I would like to add this as a feature :)

Gallardo26 commented 2 years ago

Yes, I face similar issue with my scanner/printer. I'm trying to scan multiple pages, and the printer writes and append each page to the same PDF file. However, before I could scan and append the 2nd page, Paperless had consumed the pdf and my printers writes another new PDF.

Suggest feature: 1) Allows an option to set minutes, hours, day or time of day to trigger task to consume documents. 2) A button ( at dashboard? ) to manually trigger the consume task.

kleinweby commented 2 years ago

For the polling implementation, there already exists a mechanism that checks if the "file has settled": _consume_wait_unmodified

As a quick workaround I've switched to the polling mode for now.

kleinweby commented 2 years ago

Oh, my problem is a bit different, although still fixed by using the poller. I expose the consume folder via samba and use Finder (macOS) to copy in the files. For some reason this will first create an empty file, which will trigger the inotify part, before actually filling this file.

Not sure if my problem, is actually something to be fixed/workarounded in paperless...

Gallardo26 commented 2 years ago

For the polling implementation, there already exists a mechanism that checks if the "file has settled": _consume_wait_unmodified

As a quick workaround I've switched to the polling mode for now.

May I know how do you switch to the polling mode?

kleinweby commented 2 years ago

By using PAPERLESS_CONSUMER_POLLING

nbently commented 2 years ago

I would try playing around with PAPERLESS_CONSUMER_POLLING_RETRY_COUNT and PAPERLESS_CONSUMER_POLLING_DELAY in addition to PAPERLESS_CONSUMER_POLLING. I found the delay env variable when sifting through the code and it doesn't look like it's documented anywhere but it did seem like it delayed attempting to ingest files by x number of seconds after it found a new one in the directory.

strayer commented 2 years ago

I had trouble with this too. My scanner (Canon MB5450) creates an emtpy file first and then scans the page, appends it to the file, rinse and repeat for each page.

Without polling the import is completely borked, with polling it accepts the emtpy file as "unchanged" too early before the scanner manages to save the first page. I settled to these parameters:

  PAPERLESS_CONSUMER_POLLING: "5"
  PAPERLESS_CONSUMER_POLLING_DELAY: "30"

While this makes importing anything but instant, importing documents is 100% stable for me now. As far as I can see the PAPERLESS_CONSUMER_POLLING_DELAY specifies how long the importer waits after each modification of the modified timestamp of the file. In my case, if the mtime doesn't change after 30 seconds, paperless assumes the file to be finished. If it does change, it waits 30 seconds again and repeats this process.