jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 357 forks source link

[Other] Paperless-ng won't scan consume folder after initial setup #1602

Open noonesaid opened 2 years ago

noonesaid commented 2 years ago

I just installed paperless-ng using portainer on my Ubuntu server. I have a SMB share mounted to a folder that is set to the paperless'ng consume directory. The SMB folder is where my scanner automatically sends scanned files to.

All the previous PDFs in the consume directory have been scanned. I went to scan in more files but noticed they were never consumed by paperless-ng. I checked the admin panel and there are only 4 tasks: check all email accounts, train the classifier, optimize the index, and perform sanity check.

When I added documents.tasks.consume_file (I didn't change any other paremeters in this new task besides the function) I get this error:

consume_file() missing 1 required positional argument: 'path' : Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/django_q/cluster.py", line 432, in worker
res = f(*task["args"], **task["kwargs"])
TypeError: consume_file() missing 1 required positional argument: 'path'

Does anyone know how to solve this? And is paperless-ng supposed to be scanning the folder automatically with the default settings?

woessmich commented 2 years ago

I am not 100% sure this is always like that, but I also have no explicit task listed to check the consumer directory. It scans by default but might have troubles depending on the filesystem. So it is either done automatic using inotify or set up manually. But automatic mode did not work reliably for me, I decided to switch on polling every 60 seconds by adding PAPERLESS_CONSUMER_POLLING: 60 to the environment variables in docker. See here: https://paperless-ng.readthedocs.io/en/latest/configuration.html#configuration-polling

noonesaid commented 2 years ago

Thank you woessmich, I have added that environment variable and redeployed the stack (I'm using portainer) but it still does not automatically consume the files in the folder. I don't even know how to get it to manually consume what's in the folder. I can only drag and drop files to upload it to paperless-ng.

a17t commented 2 years ago

I had a similar scenario with the original paperless. If I remember correctly, for the consume directory paperless relies on inotify to recognize file changes.

SMB/Cifs, as most other network filesystems do not create those events correctly or not at all.

woessmich commented 2 years ago

@noonesaid This is how my docker-compose.yml looks like. I am using paperless-ng in Docker on a Synology using Portainer to deplay the stack. Note: I also use custom file naming and Gotenberg and Tika for the Office documents, but that is not required.

# docker-compose file for running paperless from the Docker Hub.
# This file contains everything paperless needs to run.
# Paperless supports amd64, arm and arm64 hardware.
#
# All compose files of paperless configure paperless in the following way:
#
#- Paperless is (re)started on system boot, if it was running before shutdown.
#- Docker volumes for storing data are managed by Docker.
#- Folders for importing and exporting files are created in the same directory
#   as this file and mounted to the correct folders inside the container.
# - Paperless listens on port 8010.
#
# In addition to that, this docker-compose file adds the following optional
# configurations:
#
# - Instead of SQLite (default), PostgreSQL is used as the database server.
#
# To install and update paperless with this file, do the following:
#
# - Open portainer Stacks list and click 'Add stack'
# - Paste the contents of this file and assign a name, e.g. 'Paperless'
# - Click 'Deploy the stack' and wait for it to be deployed
# - Open the list of containers, select paperless_webserver_1
# - Click 'Console' and then 'Connect' to open the command line inside the container
# - Run 'python3 manage.py createsuperuser' to create a user
# - Exit the console
#
# For more extensive installation and update instructions, refer to the
# documentation.

version: "3.4"
services:
  broker:
    image: redis:6.0
    restart: unless-stopped

  db:
    image: postgres:13
    restart: unless-stopped
    volumes:
      - /volume1/docker/paperless-ng/pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless

  webserver:
    image: jonaswinkler/paperless-ng:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
    ports:
      - 8010:8000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000"]
      interval: 30s
      timeout: 10s
      retries: 5
    volumes:
      - /volume1/docker/paperless-ng/data:/usr/src/paperless/data
      - /volume1/docker/paperless-ng/media:/usr/src/paperless/media
      - /volume1/docker/paperless-ng/export:/usr/src/paperless/export
      - /volume1/scratch/INCOMING:/usr/src/paperless/consume
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
# The UID and GID of the user used to run paperless in the container. Set this
# to your UID and GID on the host so that you have write access to the
# consumption directory.
      USERMAP_UID: 1000
      USERMAP_GID: 100
# Additional languages to install for text recognition, separated by a
# whitespace. Note that this is
# different from PAPERLESS_OCR_LANGUAGE (default=eng), which defines the
# language used for OCR.
# The container installs English, German, Italian, Spanish and French by
# default.
# See https://packages.debian.org/search?keywords=tesseract-ocr-&searchon=names&suite=buster
# for available languages.
      #PAPERLESS_OCR_LANGUAGES: tur ces
# Adjust this key if you plan to make paperless available publicly. It should
# be a very long sequence of random characters. You don't need to remember it.
      #PAPERLESS_SECRET_KEY: change-me
# Use this variable to set a timezone for the Paperless Docker containers. If not specified, defaults to UTC.
      PAPERLESS_TIME_ZONE: Europe/Berlin
# The default language to use for OCR. Set this to the language most of your
# documents are written in.
      PAPERLESS_OCR_LANGUAGES: "eng deu"
      PAPERLESS_OCR_LANGUAGE: "deu" # most documents have this language

      PAPERLESS_TIKA_ENABLED: 1
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_ENDPOINT: http://tika:9998
      PAPERLESS_FILENAME_FORMAT: "{created_year}/{correspondent}/{document_type}_{title}_{created}"
      PAPERLESS_CONSUMER_POLLING: 60
      PAPERLESS_CONSUMER_DELETE_DUPLICATES: 1
      PAPERLESS_CONSUMER_RECURSIVE: 1
      PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS: 1
      PAPERLESS_OCR_MODE: skip
      PAPERLESS_CONSUMER_IGNORE_PATTERNS: '[".DS_STORE/*", "._*", ".stfolder/*","@eaDir/*"]'

  gotenberg:
    image: thecodingmachine/gotenberg:6
    restart: unless-stopped
    environment:
      DISABLE_GOOGLE_CHROME: 1
      DEFAULT_WAIT_TIMEOUT: 30

  tika:
    image: apache/tika:1.27
    restart: unless-stopped

volumes:
  data:
  media:
  pgdata:
noonesaid commented 2 years ago

I had a similar scenario with the original paperless. If I remember correctly, for the consume directory paperless relies on inotify to recognize file changes.

SMB/Cifs, as most other network filesystems do not create those events correctly or not at all.

Oh I see! Very interesting... I will try to change to NFS. I actually tried NFS first but couldn't get the permissions correct for some reason and paperless couldn't get write access. I'll play around with it again.

hav0ck commented 2 years ago

I will try to change to NFS

@noonesaid Did you get a chance to try NFS? I am experiencing the same issue and was wondering if switching to NFS for the consume directory worked before giving this a try.

skorvek commented 2 years ago

The listed compose file includes "PAPERLESS_CONSUMER_POLLING: 60" which means that paperless is not using inotify to schedule consumption, but rather polling the directory for changes. My system is set up on CIFS with polling as inotify would cause multiple consumption failures as it kept trying to grab partially-uploaded files.