jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

[BUG] Saving a pdf directly from a Pdf Editor in the consume folder initiate a double import job ending in an error #1084

Open Cartache opened 3 years ago

Cartache commented 3 years ago

During the import process of all my documents, I encountered some errors. In order, to "help" the import process I ended up using PDF-XChange Editor to save a modified version of the document directly in the Import folder.

This operation initiate (or seems to initiate, see screenshots) a double consume job which ends in an error "database locked".

It is important to note that the document is finally consumed even if there is an error.

To reproduce:

[2021-06-02 10:36:35,717] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/IMG_1597.pdf to the task queue.

[2021-06-02 10:36:35,932] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/IMG_1597.pdf to the task queue.

[2021-06-02 10:36:39,630] [INFO] [paperless.consumer] Consuming IMG_1597.pdf

[2021-06-02 10:36:39,630] [INFO] [paperless.consumer] Consuming IMG_1597.pdf

[2021-06-02 10:36:39,638] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2021-06-02 10:36:39,638] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2021-06-02 10:36:39,658] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2021-06-02 10:36:39,658] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2021-06-02 10:36:39,660] [DEBUG] [paperless.consumer] Parsing IMG_1597.pdf...

[2021-06-02 10:36:39,660] [DEBUG] [paperless.consumer] Parsing IMG_1597.pdf...

[2021-06-02 10:36:39,927] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/IMG_1597.pdf

[2021-06-02 10:36:39,936] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/IMG_1597.pdf

[2021-06-02 10:36:40,111] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/IMG_1597.pdf', 'output_file': '/tmp/paperless/paperless-0tl_vx_r/archive.pdf', 'use_threads': True, 'jobs': 4, 'language': 'fra+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-0tl_vx_r/sidecar.txt'}

[2021-06-02 10:36:40,112] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/IMG_1597.pdf', 'output_file': '/tmp/paperless/paperless-ugsmlyom/archive.pdf', 'use_threads': True, 'jobs': 4, 'language': 'fra+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-ugsmlyom/sidecar.txt'}

[2021-06-02 10:36:49,429] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file

[2021-06-02 10:36:49,430] [DEBUG] [paperless.consumer] Generating thumbnail for IMG_1597.pdf...

[2021-06-02 10:36:49,435] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-0tl_vx_r/archive.pdf[0] /tmp/paperless/paperless-0tl_vx_r/convert.png

[2021-06-02 10:36:49,990] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file

[2021-06-02 10:36:49,992] [DEBUG] [paperless.consumer] Generating thumbnail for IMG_1597.pdf...

[2021-06-02 10:36:49,998] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-ugsmlyom/archive.pdf[0] /tmp/paperless/paperless-ugsmlyom/convert.png

[2021-06-02 10:36:55,143] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-ugsmlyom/convert.png -out /tmp/paperless/paperless-ugsmlyom/thumb_optipng.png

[2021-06-02 10:36:55,144] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-0tl_vx_r/convert.png -out /tmp/paperless/paperless-0tl_vx_r/thumb_optipng.png

[2021-06-02 10:37:02,201] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.

[2021-06-02 10:37:02,215] [DEBUG] [paperless.consumer] Saving record to database

[2021-06-02 10:37:02,278] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.

[2021-06-02 10:37:02,284] [DEBUG] [paperless.consumer] Saving record to database

[2021-06-02 10:37:07,675] [DEBUG] [paperless.consumer] Deleting file /usr/src/paperless/src/../consume/IMG_1597.pdf

[2021-06-02 10:37:07,914] [ERROR] [paperless.consumer] The following error occured while consuming IMG_1597.pdf: database is locked

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 423, in execute

    return Database.Cursor.execute(self, query, params)

sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/consumer.py", line 290, in try_consume_file

    mime_type=mime_type

  File "/usr/src/paperless/src/documents/consumer.py", line 389, in _store

    storage_type=storage_type

  File "/usr/local/lib/python3.7/site-packages/django/db/models/manager.py", line 85, in manager_method

    return getattr(self.get_queryset(), name)(*args, **kwargs)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 453, in create

    obj.save(force_insert=True, using=self.db)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 727, in save

    force_update=force_update, update_fields=update_fields)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 765, in save_base

    force_update, using, update_fields,

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 868, in _save_table

    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 908, in _do_insert

    using=using, raw=raw,

  File "/usr/local/lib/python3.7/site-packages/django/db/models/manager.py", line 85, in manager_method

    return getattr(self.get_queryset(), name)(*args, **kwargs)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 1270, in _insert

    return query.get_compiler(using=using).execute_sql(returning_fields)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql

    cursor.execute(sql, params)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 66, in execute

    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers

    return executor(sql, params, many, context)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

  File "/usr/local/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__

    raise dj_exc_value.with_traceback(traceback) from exc_value

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 423, in execute

    return Database.Cursor.execute(self, query, params)

django.db.utils.OperationalError: database is locked

[2021-06-02 10:37:07,934] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-ugsmlyom

[2021-06-02 10:37:10,111] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-0tl_vx_r

[2021-06-02 10:37:10,112] [INFO] [paperless.consumer] Document 2021-06-02 IMG_1597 consumption finished

Relevant information

Cartache commented 3 years ago

By importing some documents I got errors during the consume process. I tried to reimport the failed documents by opening each document in PDF-Xchange Editor and saving the new pdf with a different name in the same consume folder. This procedure triggers a double import/consume that ends with an error and a success.

image image

Here is my configuration

chk86 commented 3 years ago

Hi @Cartache - I'm having a similar issue, however, it occurs even when I just copy and paste any pdf into the consume folder. That also seems to create two import jobs, one of which fails and the other succeeds.

JeremyMorel commented 2 years ago

Hey there. Was just chatting about similar earlier today over in #910. I suspect that this is specific to docker. I have a pretty good guess at what's going on... basically, documents are copied to containers in "layers" and as such, the consumer is briefly seeing two copies of the document , and spawning two sets of tasks. (see here.) Perhaps I need to see if there's a way to alter behaviors for certain folders or mount points, or somehow get the consumer to wait for the copy to truly be done before it starts the intake process. Any ideas?

JeremyMorel commented 2 years ago

More info... might be going down the wrong path, but looks like we're supposed to use a docker-specific copy command (docker cp) to move things in and out of the container. I'll mess with this a bit and report back.

JeremyMorel commented 2 years ago

Ok. So this is exactly what's going on. One snag, however, is that the docker cp command doesn't appropriately handle permissions, so when I copy directly into the consume folder, it fails due to document permissions.

dcpcomman dcperror

The fix (for now) is to use docker cp to copy to the container into somewhere that is NOT the consume folder. After that, you can use a conventional cp or mv command inside the container to get it into the consume folder. And voila, it works! Now I suppose I just need to create a consumer for the consumer... I'll sleep on it.

justone
JeremyMorel commented 2 years ago

OK - I was able to resolve this issue by implementing PAPERLESS_CONSUMER_POLLING and setting the value to 5. It appears that this will disable inotify, which was detecting duplicates while the copy operation was happening. I don't know if 5 is a reasonable number or not, so I'd welcome anyone's input.

chk86 commented 2 years ago

@JeremyMorel I was just able to try your solution and can confirm that this resolved the issue I was experiencing.

RefineryX commented 2 years ago

This appears to be causing me issues also and flooded with errors. Is adding PAPERLESS_CONSUMER_POLLING=5 the only and recommended method?

davemachado commented 2 years ago

I'm also facing this issue.

Now I suppose I just need to create a consumer for the consumer... I'll sleep on it.

@JeremyMorel I'm happy to help work on a solution if you've slept on this and have some ideas on a more correct path forward here :)

davemachado commented 2 years ago

OK - I was able to resolve this issue by implementing PAPERLESS_CONSUMER_POLLING and setting the value to 5.

Any chance there's a management utility command (similar to document_importer) or REST API for triggering the consumer on-demand? If not, maybe I could help with this? I could see a medium term-solution (for my use-case, at least) where I cp the new documents into the consume directory then trigger the consumer.

That would be better than polling all day long when docs are only copied in the consume directory a few times a day.