jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

Big PDF leads to undeleteable duplicate and unwanted error handling #75

Closed totti4ever closed 3 years ago

totti4ever commented 3 years ago

So, I threw a 40MB pdf into the consumedir, which apparently failed. When I now try to upload it again (no matter if web or cnsumption dir, I get this error in the docker log:

paperless-ng_webserver    | 19:31:11 [Q] ERROR Failed [Drucker (Canon) - Starter-Guide.pdf] - Not consuming Drucker (Canon) - Starter-Guide.pdf: It is a duplicate. : Traceback (most recent call last):
paperless-ng_webserver    |   File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
paperless-ng_webserver    |     res = f(*task["args"], **task["kwargs"])
paperless-ng_webserver    |   File "/usr/src/paperless/src/documents/tasks.py", line 69, in consume_file
paperless-ng_webserver    |     override_tag_ids=override_tag_ids)
paperless-ng_webserver    |   File "/usr/src/paperless/src/documents/consumer.py", line 84, in try_consume_file
paperless-ng_webserver    |     self.pre_check_duplicate()
paperless-ng_webserver    |   File "/usr/src/paperless/src/documents/consumer.py", line 49, in pre_check_duplicate
paperless-ng_webserver    |     "Not consuming {}: It is a duplicate.".format(self.filename)
paperless-ng_webserver    | documents.consumer.ConsumerError: Not consuming Drucker (Canon) - Starter-Guide.pdf: It is a duplicate.
paperless-ng_webserver    |
  1. I guess it is not supposed that a duplicate leads to an error in the docker log and no reaction in the front-end?
  2. How am I supposed to get that doc into paperless now? :-) Couldn't find the appropiate table holding the checksum and documents_document just doesn't hold this file

Original log from front-end was 11/30/20, 4:23 PM INFO Consuming Drucker (Canon) - Starter-Guide.pdf Nothing happend after that

jonaswinkler commented 3 years ago

Paperless checks for duplicates as one of the first steps during consumption. It does this by comparing the file's checksum against the database.

The database is updated last, after everything else succeeded, and files have been moved into place. If files failed to move for some reason, there will be no database entry. This error means that paperless has a document with the same checksum in its database.

You can check the admin, section failed tasks for errors.

totti4ever commented 3 years ago

You can check the admin, section failed tasks for errors.

Ah great, havn't had a look there. I found the first failed task:

No parsers abvailable for Drucker (Canon) - Starter-Guide.pdf : Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 69, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 94, in try_consume_file
raise ConsumerError(f"No parsers abvailable for {self.filename}")
documents.consumer.ConsumerError: No parsers abvailable for Drucker (Canon) - Starter-Guide.pdf

Paperless checks for duplicates as one of the first steps during consumption. It does this by comparing the file's checksum against the database.

Which checksum algorithm is used? Tried sha256sum, didn't match. Then I could try and see which file is the duplicate

jonaswinkler commented 3 years ago

Would you mind sending that file over, does't look confidential.

md5 is used since that's what the old paperless used and I didn't bother changing it, since that would involve some migrations that redo all the checksums.

totti4ever commented 3 years ago

Oh, found it: Detected document date 1980-11-20T00:00:00+13:00 based on string 20 tot 80

It was already part of another import that's why I didn't find that in the logs before, sorry. I am still wondering how tot transfers to November and 20 tot 80 makes a date.

Handbuch for starter: https://www.canon.de/support/consumer_products/products/fax__multifunctionals/laser/laserbase_mf_series/i-sensys_mf8340cdn.html?type=manuals&manualid=tcm:83-867528

So, I would be fine with closing unless you want to further investigate the error message. Then I would help of course!

jonaswinkler commented 3 years ago

In that case I'll close this. The date parsing library used matches many different dates and does a lot of guesswork, so it will eventually get somethings wrong.

TOT is being interpreted as the Tonga time zone.

totti4ever commented 3 years ago

Wow, that explains the +13 And then the algorithm takes the 20 as day and the 80 as year (as both can't be months) and derives the month form the current month

THAT is magic :-D