jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 358 forks source link

[BUG] 1.5.0 Task Queue Not Working for Consumption #1390

Open mrrodge2020 opened 2 years ago

mrrodge2020 commented 2 years ago

Describe the bug In-use files (still being written by scanner but discovered by paperless) state they're added to the task queue but then never get processed. Other documents get discovered and processed after this and the stalled documents get processed after a reboot.

To Reproduce Scan a document to an SMB share for consumption. Ensure it gets discovered by paperless before the scanner has finished writing the file.

Expected behavior Would expect the task queue to process the file later, allowing time for the scanner to release the file.

Screenshots N/A

Webserver logs

[2021-10-14 13:14:17,445] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/src/../consume/scan_20211014120914.pdf to remain unmodified

[2021-10-14 13:14:22,457] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/scan_20211014120914.pdf to the task queue.

Relevant information

-docker-compose.yml:

version: "3.4"
services:
  broker:
    image: redis:6.0
    restart: unless-stopped

  db:
    image: postgres:13
    restart: unless-stopped
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless

  webserver:
    image: jonaswinkler/paperless-ng:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
      - gotenberg
      - tika
    ports:
      - 8123:8000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000"]
      interval: 30s
      timeout: 10s
      retries: 5
    volumes:
      - /mnt/paperless/data:/usr/src/paperless/data
      - /mnt/paperless/media:/usr/src/paperless/media
      - ./export:/usr/src/paperless/export
      - /mnt/paperless/consumption:/usr/src/paperless/consume
    env_file: docker-compose.env
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_TIKA_ENABLED: 1
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_ENDPOINT: http://tika:9998
      PAPERLESS_CONSUMER_POLLING: 30

  gotenberg:
    image: thecodingmachine/gotenberg
    restart: unless-stopped
    environment:
      DISABLE_GOOGLE_CHROME: 1

  tika:
    image: apache/tika
    restart: unless-stopped

volumes:
  data:
  media:
  pgdata:
danschdatsci commented 2 years ago

I'm running the same version (1.5.0) also in Docker. I'm running vsftpd, but this may also apply to SMB. Make sure your consume directory has proper permissions set on the consume dir and files. I was seeing the exact same logs and noticed that the ftp service was writing files to the consume directory with only "rw" permissions as the "ftp" account I set up.

I was able to fix this by running the "setfacl" command suggested here: https://askubuntu.com/questions/969056/make-files-uploaded-by-vsftpd-automatically-inherit-owner-from-parent-directory

I ran: sudo setfacl -R -d -m u:my_root_docker_user:rwx /path/to/consume/dir

Replace "my_root_docker_user" with the username of your root account.

If that works for you, it may be a worthy addition to the FTP documentation for paperless.

Side note: I have no idea what the security implications of adding "x" are in a public/shared environment. I am only running paperless and the ftp service on a private lan, so I'm ok with it for my situation.

mrrodge2020 commented 2 years ago

Thanks - Most of it straight over my head though. I'm running a Windows SMB share with full control permissions. Docker host is running on Debian and is mapped to a folder with 1000:1000 ownership. Nothing seems to have any access trouble, it's just the queue not starting up after an action gets postponed.

danschdatsci commented 2 years ago

Ah. My docker host and ftp server are on the same ubuntu box. That's why a lot of it probably didn't make sense.

mrrodge2020 commented 2 years ago

Ah right haha. I genuinely don't think this is a permission issue as other docs process OK - it's because the scanner is still writing the file when paperless picks it up, so it gets added to the queue to check back later, which it then never does.

smseidl commented 2 years ago

I feel like I'm having a similar problem, but I'm not seeing Paperless even identifying the files are there (i.e. no Waiting for file log entry.) I have the polling set to 30 so I don't know why its not getting picked up. I have Paperless running in Docker on a Raspberry Pi 4 with a SMB folder mounted for the consume folder.

CapoD commented 2 years ago

Ah right haha. I genuinely don't think this is a permission issue as other docs process OK - it's because the scanner is still writing the file when paperless picks it up, so it gets added to the queue to check back later, which it then never does.

This is quite a useful hint, since I believe that this problem might explain an observed problem which prevents Paperless to process large PDF files (~30 pages)