jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

duplicate key value violates unique constraint #1462

Open ingorichter opened 2 years ago

ingorichter commented 2 years ago

Describe the bug I've installed paperless in a docker container on my Synology NAS. I mounted the consume folder on my Mac. Once I drop a file into the consume folder, paperless starts processing the dropped file. Once the process finishes I get the error message mentioned in the title.

This doesn't happen when I upload the file via the web app.

To Reproduce Steps to reproduce the behavior:

  1. Mount consume folder
  2. Drop file into the consume folder
  3. paperless starts processing the file
  4. notifications appear mentioning the error

Expected behavior No error about the constraint violation

Screenshots If applicable, add screenshots to help explain your problem.

Webserver logs

[2021-12-01 21:14:35,593] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/OpenOfficeProgrammierung-1.pdf to the task queue.

[2021-12-01 21:14:35,700] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/OpenOfficeProgrammierung-1.pdf to the task queue.

[2021-12-01 21:14:35,719] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/OpenOfficeProgrammierung-1.pdf to the task queue.

[2021-12-01 21:14:36,279] [INFO] [paperless.consumer] Consuming OpenOfficeProgrammierung-1.pdf

[2021-12-01 21:14:36,280] [INFO] [paperless.consumer] Consuming OpenOfficeProgrammierung-1.pdf

[2021-12-01 21:14:36,347] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2021-12-01 21:14:36,347] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2021-12-01 21:14:36,566] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2021-12-01 21:14:36,567] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2021-12-01 21:14:36,673] [DEBUG] [paperless.consumer] Parsing OpenOfficeProgrammierung-1.pdf...

[2021-12-01 21:14:36,678] [DEBUG] [paperless.consumer] Parsing OpenOfficeProgrammierung-1.pdf...

[2021-12-01 21:14:39,122] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/OpenOfficeProgrammierung-1.pdf

[2021-12-01 21:14:39,134] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/OpenOfficeProgrammierung-1.pdf

[2021-12-01 21:14:40,280] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/OpenOfficeProgrammierung-1.pdf', 'output_file': '/tmp/paperless/paperless-n2s1fael/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-n2s1fael/sidecar.txt'}

[2021-12-01 21:14:40,281] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/OpenOfficeProgrammierung-1.pdf', 'output_file': '/tmp/paperless/paperless-r2hzjrap/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-r2hzjrap/sidecar.txt'}

[2021-12-01 21:16:37,738] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file

[2021-12-01 21:16:37,738] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file

[2021-12-01 21:16:37,740] [DEBUG] [paperless.consumer] Generating thumbnail for OpenOfficeProgrammierung-1.pdf...

[2021-12-01 21:16:37,741] [DEBUG] [paperless.consumer] Generating thumbnail for OpenOfficeProgrammierung-1.pdf...

[2021-12-01 21:16:37,753] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-n2s1fael/archive.pdf[0] /tmp/paperless/paperless-n2s1fael/convert.png

[2021-12-01 21:16:37,754] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-r2hzjrap/archive.pdf[0] /tmp/paperless/paperless-r2hzjrap/convert.png

[2021-12-01 21:16:46,571] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-r2hzjrap/convert.png -out /tmp/paperless/paperless-r2hzjrap/thumb_optipng.png

[2021-12-01 21:16:46,699] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-n2s1fael/convert.png -out /tmp/paperless/paperless-n2s1fael/thumb_optipng.png

[2021-12-01 21:16:52,099] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.

[2021-12-01 21:16:52,110] [DEBUG] [paperless.consumer] Saving record to database

[2021-12-01 21:16:52,219] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.

[2021-12-01 21:16:52,230] [DEBUG] [paperless.consumer] Saving record to database

[2021-12-01 21:16:52,469] [DEBUG] [paperless.consumer] Deleting file /usr/src/paperless/src/../consume/OpenOfficeProgrammierung-1.pdf

[2021-12-01 21:16:52,922] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-r2hzjrap

[2021-12-01 21:16:52,925] [INFO] [paperless.consumer] Document 2010-03-16 OpenOfficeProgrammierung-1 consumption finished

[2021-12-01 21:16:52,935] [ERROR] [paperless.consumer] The following error occured while consuming OpenOfficeProgrammierung-1.pdf: duplicate key value violates unique constraint "documents_document_checksum_75209391_uniq"

DETAIL:  Key (checksum)=(c240e6757436f0c1edbc354743ea2d57) already exists.

Traceback (most recent call last):

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "documents_document_checksum_75209391_uniq"

DETAIL:  Key (checksum)=(c240e6757436f0c1edbc354743ea2d57) already exists.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/consumer.py", line 287, in try_consume_file

    document = self._store(

  File "/usr/src/paperless/src/documents/consumer.py", line 382, in _store

    document = Document.objects.create(

  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method

    return getattr(self.get_queryset(), name)(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 453, in create

    obj.save(force_insert=True, using=self.db)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 726, in save

    self.save_base(using=using, force_insert=force_insert,

  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 763, in save_base

    updated = self._save_table(

  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 868, in _save_table

    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 906, in _do_insert

    return manager._insert(

  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method

    return getattr(self.get_queryset(), name)(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert

    return query.get_compiler(using=using).execute_sql(returning_fields)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql

    cursor.execute(sql, params)

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute

    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers

    return executor(sql, params, many, context)

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

  File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__

    raise dj_exc_value.with_traceback(traceback) from exc_value

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

django.db.utils.IntegrityError: duplicate key value violates unique constraint "documents_document_checksum_75209391_uniq"

DETAIL:  Key (checksum)=(c240e6757436f0c1edbc354743ea2d57) already exists.

[2021-12-01 21:16:53,028] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-n2s1fael

[2021-12-01 21:16:53,555] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/OpenOfficeProgrammierung-1.pdf: File not found.

[2021-12-02 00:24:24,463] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.

Relevant information

by default 2021-12-01 at 1 17 25 PM
user34756361233 commented 2 years ago

My hardware is the same, running docker on Synology DS918/ 12Gb / DSM 7.0.1-42218. I have mapped all my path onto the synology dir structure. I see the same error messages, sometimes the file is added, sometimes its send into oblivia...

my docker-compose file:

# docker-compose file for running paperless from the Docker Hub.
# This file contains everything paperless needs to run.
# Paperless supports amd64, arm and arm64 hardware.
#
# All compose files of paperless configure paperless in the following way:
#
# - Paperless is (re)started on system boot, if it was running before shutdown.
# - Docker volumes for storing data are managed by Docker.
# - Folders for importing and exporting files are created in the same directory
#   as this file and mounted to the correct folders inside the container.
# - Paperless listens on port 8010.
#
# In addition to that, this docker-compose file adds the following optional
# configurations:
#
# - Instead of SQLite (default), PostgreSQL is used as the database server.
#
# To install and update paperless with this file, do the following:
#
# - Open portainer Stacks list and click 'Add stack'
# - Paste the contents of this file and assign a name, e.g. 'Paperless'
# - Click 'Deploy the stack' and wait for it to be deployed
# - Open the list of containers, select paperless_webserver_1
# - Click 'Console' and then 'Connect' to open the command line inside the container
# - Run 'python3 manage.py createsuperuser' to create a user
# - Exit the console
#
# For more extensive installation and update instructions, refer to the
# documentation.

version: "3.4"
services:
  broker:
    image: redis:6.0
    restart: unless-stopped

  db:
    image: postgres:13
    restart: unless-stopped
    volumes:
      - /volume1/docker/paperless/pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless

  webserver:
    image: jonaswinkler/paperless-ng:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
    ports:
      - 8010:8000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000"]
      interval: 30s
      timeout: 10s
      retries: 5
    volumes:
      - /volume1/docker/paperless/data:/usr/src/paperless/data
      - /volume1/docker/paperless/media:/usr/src/paperless/media
      - /volume1/docker/paperless/export:/usr/src/paperless/export
      - /volume1/docker/paperless/consume:/usr/src/paperless/consume
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_FILENAME_FORMAT: "{created_year}/{correspondent}/{title}"
# The UID and GID of the user used to run paperless in the container. Set this
# to your UID and GID on the host so that you have write access to the
# consumption directory.
      USERMAP_UID: 1000
      USERMAP_GID: 100
# Additional languages to install for text recognition, separated by a
# whitespace. Note that this is
# different from PAPERLESS_OCR_LANGUAGE (default=eng), which defines the
# language used for OCR.
# The container installs English, German, Italian, Spanish and French by
# default.
# See https://packages.debian.org/search?keywords=tesseract-ocr-&searchon=names&suite=buster
# for available languages.
      PAPERLESS_OCR_LANGUAGES: nld
# Adjust this key if you plan to make paperless available publicly. It should
# be a very long sequence of random characters. You don't need to remember it.
      #PAPERLESS_SECRET_KEY: change-me
# Use this variable to set a timezone for the Paperless Docker containers. If not specified, defaults to UTC.
      PAPERLESS_TIME_ZONE: Europe/Amsterdam
# The default language to use for OCR. Set this to the language most of your
# documents are written in.
      PAPERLESS_OCR_LANGUAGE: nld+eng

volumes:
  data:
  media:
  pgdata:

my log:

[2022-01-08 23:43:28,177] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,191] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,197] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,230] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,238] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,246] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,252] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,287] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,390] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,399] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.

[2022-01-08 23:43:28,454] [INFO] [paperless.consumer] Consuming label.pdf

[2022-01-08 23:43:28,454] [INFO] [paperless.consumer] Consuming label.pdf

[2022-01-08 23:43:28,705] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2022-01-08 23:43:28,706] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2022-01-08 23:43:31,470] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2022-01-08 23:43:31,470] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2022-01-08 23:43:31,480] [DEBUG] [paperless.consumer] Parsing label.pdf...

[2022-01-08 23:43:31,481] [DEBUG] [paperless.consumer] Parsing label.pdf...

[2022-01-08 23:43:34,847] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/label.pdf

[2022-01-08 23:43:34,867] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/label.pdf

[2022-01-08 23:43:38,644] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/label.pdf', 'output_file': '/tmp/paperless/paperless-wjkvsuzt/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'nld+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-wjkvsuzt/sidecar.txt'}

[2022-01-08 23:43:38,644] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/label.pdf', 'output_file': '/tmp/paperless/paperless-23r_htkv/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'nld+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-23r_htkv/sidecar.txt'}

[2022-01-08 23:43:52,565] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.

[2022-01-08 23:43:52,566] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.

[2022-01-08 23:43:53,456] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-wjkvsuzt/archive.pdf

[2022-01-08 23:43:53,457] [DEBUG] [paperless.consumer] Generating thumbnail for label.pdf...

[2022-01-08 23:43:53,470] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-wjkvsuzt/archive.pdf[0] /tmp/paperless/paperless-wjkvsuzt/convert.png

[2022-01-08 23:43:53,480] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-23r_htkv/archive.pdf

[2022-01-08 23:43:53,481] [DEBUG] [paperless.consumer] Generating thumbnail for label.pdf...

[2022-01-08 23:43:53,492] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-23r_htkv/archive.pdf[0] /tmp/paperless/paperless-23r_htkv/convert.png

[2022-01-08 23:43:57,129] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-23r_htkv/convert.png -out /tmp/paperless/paperless-23r_htkv/thumb_optipng.png

[2022-01-08 23:43:57,157] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-wjkvsuzt/convert.png -out /tmp/paperless/paperless-wjkvsuzt/thumb_optipng.png

[2022-01-08 23:44:04,074] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.

[2022-01-08 23:44:04,087] [DEBUG] [paperless.consumer] Saving record to database

[2022-01-08 23:44:04,182] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.

[2022-01-08 23:44:04,192] [DEBUG] [paperless.consumer] Saving record to database

[2022-01-08 23:44:06,336] [DEBUG] [paperless.consumer] Deleting file /usr/src/paperless/src/../consume/label.pdf

[2022-01-08 23:44:07,367] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-23r_htkv

[2022-01-08 23:44:07,371] [INFO] [paperless.consumer] Document 2021-08-30 label consumption finished

[2022-01-08 23:44:07,747] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.

[2022-01-08 23:44:08,156] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.

[2022-01-08 23:44:07,378] [ERROR] [paperless.consumer] The following error occured while consuming label.pdf: duplicate key value violates unique constraint "documents_document_checksum_75209391_uniq"

DETAIL:  Key (checksum)=(c9e2f5e02d99e5301c49d5602f552a27) already exists.

Traceback (most recent call last):

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "documents_document_checksum_75209391_uniq"

DETAIL:  Key (checksum)=(c9e2f5e02d99e5301c49d5602f552a27) already exists.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/consumer.py", line 287, in try_consume_file

    document = self._store(

  File "/usr/src/paperless/src/documents/consumer.py", line 382, in _store

    document = Document.objects.create(

  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method

    return getattr(self.get_queryset(), name)(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 453, in create

    obj.save(force_insert=True, using=self.db)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 726, in save

    self.save_base(using=using, force_insert=force_insert,

  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 763, in save_base

    updated = self._save_table(

  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 868, in _save_table

    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 906, in _do_insert

    return manager._insert(

  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method

    return getattr(self.get_queryset(), name)(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert

    return query.get_compiler(using=using).execute_sql(returning_fields)

  File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql

    cursor.execute(sql, params)

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute

    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers

    return executor(sql, params, many, context)

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

  File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__

    raise dj_exc_value.with_traceback(traceback) from exc_value

  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

django.db.utils.IntegrityError: duplicate key value violates unique constraint "documents_document_checksum_75209391_uniq"

DETAIL:  Key (checksum)=(c9e2f5e02d99e5301c49d5602f552a27) already exists.

[2022-01-08 23:44:08,261] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-wjkvsuzt

[2022-01-08 23:44:08,662] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.

[2022-01-08 23:44:09,163] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.

[2022-01-08 23:44:09,686] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.

[2022-01-08 23:44:10,191] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.

[2022-01-08 23:44:10,691] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.

[2022-01-08 23:44:11,257] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.
DarrienG commented 2 years ago

I found this happened to me when I uploaded the same document twice. It uses a checksum for seeing if the files are the same or not. Filename does not matter. This does happen to me when using the web app.

I believe this is intentional and a feature.

user34756361233 commented 2 years ago

I found this happened to me when I uploaded the same document twice. It uses a checksum for seeing if the files are the same or not. Filename does not matter. This does happen to me when using the web app.

I believe this is intentional and a feature.

I can understand if this is the case, but it wasn’t. There were about 100 unique files, put in the consume folder and each gave multiple errors. It looked to me that different processen where taking in the file, concluding it was already there or already gone… Doing this via the web interface gave the same behaviour.

jas0420 commented 2 years ago

I have something similar happening... I am about 10 days into my Paperless journey (amazing software!) and have, so far, had my workflow set up two different ways. One is producing this unique constraint violation (occasionally, not every document) the other (as best I observed at the time) was not. These are all docs that are brand new to my Paperless environment, so definitely not rescanning something it has already seen.

Original setup: Canon R40 scanner with OCR PDF/A enabled scanning at 300dpi Fresh install of Paperless NG running in a Docker container on a Synology DS1813+ R40 scanner connected to a Windows PC Windows PC has drive S: mapped to Synology shared folder "docker" R40 scanner's output folder is to S:\paperless\data\consume PAPERLESS_OCR_MODE=skip_noarchive

To the best of my knowledge, I was not seeing any unique constraint violations with the above setup. My OCR results were lackluster though and I was having to rekey a lot of dates in Paperless that I thought should have been readable. I tried letting Paperless do the OCR on the NAS and it was taking a couple of minutes for a document to process.

I had done about 450 or so bills / docs so far but still have a gazillion to go, so I tried injecting an external Tesseract process. I gave up trying to do it in a Docker container on the Windows PC where the scanner hangs off of and eventually got it going on an Ubuntu box I have. For this, I used the package ocrmypdf.

So now the setup looks like this and I'm seeing the duplicate key violations: Canon R40 scanner with OCR PDF/A disabled scanning at 300dpi The same instance as above of Paperless NG running in a Docker container on a Synology DS1813+ R40 scanner connected to a Windows PC Windows PC has drive O: mapped to ocrmypdf folder on the Ubuntu box R40 scanner's output folder is to O:\In On the Ubuntu box, the same "Docker" shared folder on the Synology is mounted to /mnt/ocr ocrmypdf's output folder is /mnt/ocr/paperless/data/consume PAPERLESS_OCR_MODE=skip_noarchive

This OCRs considerably faster than the NAS appears to be able to do. Perhaps if/when I get my mountain of old docs digitized I can just let the NAS do it and simplify the setup again, but while scanning in bulk right now speed is important to me.

All boxes are on gigabit (the NAS is actually 4Gb w/ 4 grouped gig ports) on a low traffic home network.

So from Paperless's standpoint, nothing really changed... From a workflow standpoint, for whatever reason, Windows dropping files into the Consume folder of Paperless seems to play nicer than when Ubuntu does it. It occasionally adds it to the queue twice and, from there, attempts to process it twice.

I suspect, but don't have the know-how to confirm, that Paperless is sometimes trying to grab a file a split second later that is still in the process of being copied over. Seems like something about Ubuntu doing the copying isn't "locking" the file in the same manner that Windows does and Paperless just sees the file on two different polling cycles (or maybe Paperless can use events and polling to detect new docs and both are being triggered in the Ubuntu process)... I'm obviously speculating... Pretty far out of my comfort zone when anything linux is involved.

`[2022-01-31 08:32:36,611] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/November 2018 Statement.pdf to the task queue.

[2022-01-31 08:32:36,627] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/November 2018 Statement.pdf to the task queue.

[2022-01-31 08:32:37,110] [INFO] [paperless.consumer] Consuming November 2018 Statement.pdf

[2022-01-31 08:32:37,116] [INFO] [paperless.consumer] Consuming November 2018 Statement.pdf

[2022-01-31 08:32:37,122] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2022-01-31 08:32:37,127] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2022-01-31 08:32:37,185] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2022-01-31 08:32:37,185] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2022-01-31 08:32:37,208] [DEBUG] [paperless.consumer] Parsing November 2018 Statement.pdf...

[2022-01-31 08:32:37,209] [DEBUG] [paperless.consumer] Parsing November 2018 Statement.pdf...`

ingorichter commented 2 years ago

I experimented with a solution today and in the end I was successful. The reason why this happens is that the INotify interface is emitting a lot of events for the same file. This is by design and in general it's not a problem. In this scenario, there is more than one task created for the same file and that will ultimately lead to the issue that we can see here. My solution is queue that debounces the events for a file. Instead of calling _consume directly in the INotify event handler, I add the file to the queue and start a timer to call _consume after a cool-off period (in my case 5 seconds seemed to be enough) and then call _consume with the queued file. I'll create a PR with my changes and hope that it will solve that issue for others too.

jas0420 commented 2 years ago

Nice work!!