Open ingorichter opened 2 years ago
My hardware is the same, running docker on Synology DS918/ 12Gb / DSM 7.0.1-42218. I have mapped all my path onto the synology dir structure. I see the same error messages, sometimes the file is added, sometimes its send into oblivia...
my docker-compose file:
# docker-compose file for running paperless from the Docker Hub.
# This file contains everything paperless needs to run.
# Paperless supports amd64, arm and arm64 hardware.
#
# All compose files of paperless configure paperless in the following way:
#
# - Paperless is (re)started on system boot, if it was running before shutdown.
# - Docker volumes for storing data are managed by Docker.
# - Folders for importing and exporting files are created in the same directory
# as this file and mounted to the correct folders inside the container.
# - Paperless listens on port 8010.
#
# In addition to that, this docker-compose file adds the following optional
# configurations:
#
# - Instead of SQLite (default), PostgreSQL is used as the database server.
#
# To install and update paperless with this file, do the following:
#
# - Open portainer Stacks list and click 'Add stack'
# - Paste the contents of this file and assign a name, e.g. 'Paperless'
# - Click 'Deploy the stack' and wait for it to be deployed
# - Open the list of containers, select paperless_webserver_1
# - Click 'Console' and then 'Connect' to open the command line inside the container
# - Run 'python3 manage.py createsuperuser' to create a user
# - Exit the console
#
# For more extensive installation and update instructions, refer to the
# documentation.
version: "3.4"
services:
broker:
image: redis:6.0
restart: unless-stopped
db:
image: postgres:13
restart: unless-stopped
volumes:
- /volume1/docker/paperless/pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: paperless
webserver:
image: jonaswinkler/paperless-ng:latest
restart: unless-stopped
depends_on:
- db
- broker
ports:
- 8010:8000
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000"]
interval: 30s
timeout: 10s
retries: 5
volumes:
- /volume1/docker/paperless/data:/usr/src/paperless/data
- /volume1/docker/paperless/media:/usr/src/paperless/media
- /volume1/docker/paperless/export:/usr/src/paperless/export
- /volume1/docker/paperless/consume:/usr/src/paperless/consume
environment:
PAPERLESS_REDIS: redis://broker:6379
PAPERLESS_DBHOST: db
PAPERLESS_FILENAME_FORMAT: "{created_year}/{correspondent}/{title}"
# The UID and GID of the user used to run paperless in the container. Set this
# to your UID and GID on the host so that you have write access to the
# consumption directory.
USERMAP_UID: 1000
USERMAP_GID: 100
# Additional languages to install for text recognition, separated by a
# whitespace. Note that this is
# different from PAPERLESS_OCR_LANGUAGE (default=eng), which defines the
# language used for OCR.
# The container installs English, German, Italian, Spanish and French by
# default.
# See https://packages.debian.org/search?keywords=tesseract-ocr-&searchon=names&suite=buster
# for available languages.
PAPERLESS_OCR_LANGUAGES: nld
# Adjust this key if you plan to make paperless available publicly. It should
# be a very long sequence of random characters. You don't need to remember it.
#PAPERLESS_SECRET_KEY: change-me
# Use this variable to set a timezone for the Paperless Docker containers. If not specified, defaults to UTC.
PAPERLESS_TIME_ZONE: Europe/Amsterdam
# The default language to use for OCR. Set this to the language most of your
# documents are written in.
PAPERLESS_OCR_LANGUAGE: nld+eng
volumes:
data:
media:
pgdata:
my log:
[2022-01-08 23:43:28,177] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,191] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,197] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,230] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,238] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,246] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,252] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,287] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,390] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,399] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/label.pdf to the task queue.
[2022-01-08 23:43:28,454] [INFO] [paperless.consumer] Consuming label.pdf
[2022-01-08 23:43:28,454] [INFO] [paperless.consumer] Consuming label.pdf
[2022-01-08 23:43:28,705] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-01-08 23:43:28,706] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-01-08 23:43:31,470] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-01-08 23:43:31,470] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-01-08 23:43:31,480] [DEBUG] [paperless.consumer] Parsing label.pdf...
[2022-01-08 23:43:31,481] [DEBUG] [paperless.consumer] Parsing label.pdf...
[2022-01-08 23:43:34,847] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/label.pdf
[2022-01-08 23:43:34,867] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/label.pdf
[2022-01-08 23:43:38,644] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/label.pdf', 'output_file': '/tmp/paperless/paperless-wjkvsuzt/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'nld+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-wjkvsuzt/sidecar.txt'}
[2022-01-08 23:43:38,644] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/label.pdf', 'output_file': '/tmp/paperless/paperless-23r_htkv/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'nld+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-23r_htkv/sidecar.txt'}
[2022-01-08 23:43:52,565] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2022-01-08 23:43:52,566] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2022-01-08 23:43:53,456] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-wjkvsuzt/archive.pdf
[2022-01-08 23:43:53,457] [DEBUG] [paperless.consumer] Generating thumbnail for label.pdf...
[2022-01-08 23:43:53,470] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-wjkvsuzt/archive.pdf[0] /tmp/paperless/paperless-wjkvsuzt/convert.png
[2022-01-08 23:43:53,480] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-23r_htkv/archive.pdf
[2022-01-08 23:43:53,481] [DEBUG] [paperless.consumer] Generating thumbnail for label.pdf...
[2022-01-08 23:43:53,492] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-23r_htkv/archive.pdf[0] /tmp/paperless/paperless-23r_htkv/convert.png
[2022-01-08 23:43:57,129] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-23r_htkv/convert.png -out /tmp/paperless/paperless-23r_htkv/thumb_optipng.png
[2022-01-08 23:43:57,157] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-wjkvsuzt/convert.png -out /tmp/paperless/paperless-wjkvsuzt/thumb_optipng.png
[2022-01-08 23:44:04,074] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-01-08 23:44:04,087] [DEBUG] [paperless.consumer] Saving record to database
[2022-01-08 23:44:04,182] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2022-01-08 23:44:04,192] [DEBUG] [paperless.consumer] Saving record to database
[2022-01-08 23:44:06,336] [DEBUG] [paperless.consumer] Deleting file /usr/src/paperless/src/../consume/label.pdf
[2022-01-08 23:44:07,367] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-23r_htkv
[2022-01-08 23:44:07,371] [INFO] [paperless.consumer] Document 2021-08-30 label consumption finished
[2022-01-08 23:44:07,747] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.
[2022-01-08 23:44:08,156] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.
[2022-01-08 23:44:07,378] [ERROR] [paperless.consumer] The following error occured while consuming label.pdf: duplicate key value violates unique constraint "documents_document_checksum_75209391_uniq"
DETAIL: Key (checksum)=(c9e2f5e02d99e5301c49d5602f552a27) already exists.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "documents_document_checksum_75209391_uniq"
DETAIL: Key (checksum)=(c9e2f5e02d99e5301c49d5602f552a27) already exists.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 287, in try_consume_file
document = self._store(
File "/usr/src/paperless/src/documents/consumer.py", line 382, in _store
document = Document.objects.create(
File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 453, in create
obj.save(force_insert=True, using=self.db)
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 726, in save
self.save_base(using=using, force_insert=force_insert,
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 763, in save_base
updated = self._save_table(
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 868, in _save_table
results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 906, in _do_insert
return manager._insert(
File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
cursor.execute(sql, params)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "documents_document_checksum_75209391_uniq"
DETAIL: Key (checksum)=(c9e2f5e02d99e5301c49d5602f552a27) already exists.
[2022-01-08 23:44:08,261] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-wjkvsuzt
[2022-01-08 23:44:08,662] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.
[2022-01-08 23:44:09,163] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.
[2022-01-08 23:44:09,686] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.
[2022-01-08 23:44:10,191] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.
[2022-01-08 23:44:10,691] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.
[2022-01-08 23:44:11,257] [ERROR] [paperless.consumer] Cannot consume /usr/src/paperless/src/../consume/label.pdf: File not found.
I found this happened to me when I uploaded the same document twice. It uses a checksum for seeing if the files are the same or not. Filename does not matter. This does happen to me when using the web app.
I believe this is intentional and a feature.
I found this happened to me when I uploaded the same document twice. It uses a checksum for seeing if the files are the same or not. Filename does not matter. This does happen to me when using the web app.
I believe this is intentional and a feature.
I can understand if this is the case, but it wasn’t. There were about 100 unique files, put in the consume folder and each gave multiple errors. It looked to me that different processen where taking in the file, concluding it was already there or already gone… Doing this via the web interface gave the same behaviour.
I have something similar happening... I am about 10 days into my Paperless journey (amazing software!) and have, so far, had my workflow set up two different ways. One is producing this unique constraint violation (occasionally, not every document) the other (as best I observed at the time) was not. These are all docs that are brand new to my Paperless environment, so definitely not rescanning something it has already seen.
Original setup: Canon R40 scanner with OCR PDF/A enabled scanning at 300dpi Fresh install of Paperless NG running in a Docker container on a Synology DS1813+ R40 scanner connected to a Windows PC Windows PC has drive S: mapped to Synology shared folder "docker" R40 scanner's output folder is to S:\paperless\data\consume PAPERLESS_OCR_MODE=skip_noarchive
To the best of my knowledge, I was not seeing any unique constraint violations with the above setup. My OCR results were lackluster though and I was having to rekey a lot of dates in Paperless that I thought should have been readable. I tried letting Paperless do the OCR on the NAS and it was taking a couple of minutes for a document to process.
I had done about 450 or so bills / docs so far but still have a gazillion to go, so I tried injecting an external Tesseract process. I gave up trying to do it in a Docker container on the Windows PC where the scanner hangs off of and eventually got it going on an Ubuntu box I have. For this, I used the package ocrmypdf.
So now the setup looks like this and I'm seeing the duplicate key violations: Canon R40 scanner with OCR PDF/A disabled scanning at 300dpi The same instance as above of Paperless NG running in a Docker container on a Synology DS1813+ R40 scanner connected to a Windows PC Windows PC has drive O: mapped to ocrmypdf folder on the Ubuntu box R40 scanner's output folder is to O:\In On the Ubuntu box, the same "Docker" shared folder on the Synology is mounted to /mnt/ocr ocrmypdf's output folder is /mnt/ocr/paperless/data/consume PAPERLESS_OCR_MODE=skip_noarchive
This OCRs considerably faster than the NAS appears to be able to do. Perhaps if/when I get my mountain of old docs digitized I can just let the NAS do it and simplify the setup again, but while scanning in bulk right now speed is important to me.
All boxes are on gigabit (the NAS is actually 4Gb w/ 4 grouped gig ports) on a low traffic home network.
So from Paperless's standpoint, nothing really changed... From a workflow standpoint, for whatever reason, Windows dropping files into the Consume folder of Paperless seems to play nicer than when Ubuntu does it. It occasionally adds it to the queue twice and, from there, attempts to process it twice.
I suspect, but don't have the know-how to confirm, that Paperless is sometimes trying to grab a file a split second later that is still in the process of being copied over. Seems like something about Ubuntu doing the copying isn't "locking" the file in the same manner that Windows does and Paperless just sees the file on two different polling cycles (or maybe Paperless can use events and polling to detect new docs and both are being triggered in the Ubuntu process)... I'm obviously speculating... Pretty far out of my comfort zone when anything linux is involved.
`[2022-01-31 08:32:36,611] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/November 2018 Statement.pdf to the task queue.
[2022-01-31 08:32:36,627] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/November 2018 Statement.pdf to the task queue.
[2022-01-31 08:32:37,110] [INFO] [paperless.consumer] Consuming November 2018 Statement.pdf
[2022-01-31 08:32:37,116] [INFO] [paperless.consumer] Consuming November 2018 Statement.pdf
[2022-01-31 08:32:37,122] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-01-31 08:32:37,127] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-01-31 08:32:37,185] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-01-31 08:32:37,185] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-01-31 08:32:37,208] [DEBUG] [paperless.consumer] Parsing November 2018 Statement.pdf...
[2022-01-31 08:32:37,209] [DEBUG] [paperless.consumer] Parsing November 2018 Statement.pdf...`
I experimented with a solution today and in the end I was successful. The reason why this happens is that the INotify
interface is emitting a lot of events for the same file. This is by design and in general it's not a problem. In this scenario, there is more than one task created for the same file and that will ultimately lead to the issue that we can see here. My solution is queue that debounces the events for a file. Instead of calling _consume
directly in the INotify
event handler, I add the file to the queue and start a timer to call _consume
after a cool-off period (in my case 5 seconds seemed to be enough) and then call _consume
with the queued file. I'll create a PR with my changes and hope that it will solve that issue for others too.
Nice work!!
Describe the bug I've installed paperless in a docker container on my Synology NAS. I mounted the
consume
folder on my Mac. Once I drop a file into theconsume
folder, paperless starts processing the dropped file. Once the process finishes I get the error message mentioned in the title.This doesn't happen when I upload the file via the web app.
To Reproduce Steps to reproduce the behavior:
Expected behavior No error about the constraint violation
Screenshots If applicable, add screenshots to help explain your problem.
Webserver logs
Relevant information
docker-compose.yml
,docker-compose.env
orpaperless.conf
.