Closed totti4ever closed 3 years ago
Paperless checks for duplicates as one of the first steps during consumption. It does this by comparing the file's checksum against the database.
The database is updated last, after everything else succeeded, and files have been moved into place. If files failed to move for some reason, there will be no database entry. This error means that paperless has a document with the same checksum in its database.
You can check the admin, section failed tasks for errors.
You can check the admin, section failed tasks for errors.
Ah great, havn't had a look there. I found the first failed task:
No parsers abvailable for Drucker (Canon) - Starter-Guide.pdf : Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 69, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 94, in try_consume_file
raise ConsumerError(f"No parsers abvailable for {self.filename}")
documents.consumer.ConsumerError: No parsers abvailable for Drucker (Canon) - Starter-Guide.pdf
Paperless checks for duplicates as one of the first steps during consumption. It does this by comparing the file's checksum against the database.
Which checksum algorithm is used? Tried sha256sum, didn't match. Then I could try and see which file is the duplicate
Would you mind sending that file over, does't look confidential.
md5 is used since that's what the old paperless used and I didn't bother changing it, since that would involve some migrations that redo all the checksums.
Oh, found it: Detected document date 1980-11-20T00:00:00+13:00 based on string 20 tot 80
It was already part of another import that's why I didn't find that in the logs before, sorry. I am still wondering how tot transfers to November and 20 tot 80 makes a date.
Handbuch for starter: https://www.canon.de/support/consumer_products/products/fax__multifunctionals/laser/laserbase_mf_series/i-sensys_mf8340cdn.html?type=manuals&manualid=tcm:83-867528
So, I would be fine with closing unless you want to further investigate the error message. Then I would help of course!
In that case I'll close this. The date parsing library used matches many different dates and does a lot of guesswork, so it will eventually get somethings wrong.
TOT is being interpreted as the Tonga time zone.
Wow, that explains the +13 And then the algorithm takes the 20 as day and the 80 as year (as both can't be months) and derives the month form the current month
THAT is magic :-D
So, I threw a 40MB pdf into the consumedir, which apparently failed. When I now try to upload it again (no matter if web or cnsumption dir, I get this error in the docker log:
Original log from front-end was
11/30/20, 4:23 PM INFO Consuming Drucker (Canon) - Starter-Guide.pdf
Nothing happend after that