jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 356 forks source link

[BUG] Documents (PDF) remain the the consume directory and do not get processed. #1566

Open jase31 opened 2 years ago

jase31 commented 2 years ago

Describe the bug Most files added to consume directly are added, but certain pdf remain sitting in the consume directory and do not disappear / get processed.

Expected behavior A clear and concise description of what you expected to happen

Files added to consume should be added.


Synology,
KaiBoos commented 2 years ago

Hello,

can you please post the logs. Depending on your language its under "Protokoll" or "Log" on your paperless-ng webpage.

Most of the time, there is an error description why it couldnt be imported.

When ypu go to "Administration" and then "failed Jobs" you will also see the reason why it hasnt been imported.

-Kai

jase31 commented 2 years ago

Logs: [2022-01-23 01:41:52,256] [INFO] [paperless.consumer] Consuming brake.pdf

[2022-01-23 01:41:53,044] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2022-01-23 01:41:59,654] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2022-01-23 01:41:59,722] [DEBUG] [paperless.consumer] Parsing brake.pdf...

[2022-01-23 01:42:21,818] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /data/consume/brake.pdf

[2022-01-23 01:42:33,183] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/data/consume/brake.pdf', 'output_file': '/tmp/paperless/paperless-4x_8sck6/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-4x_8sck6/sidecar.txt'}

[2022-01-23 02:15:25,860] [DEBUG] [paperless.classifier] Gathering data from database...

[2022-01-23 02:16:26,163] [DEBUG] [paperless.tasks] Training data unchanged.

Thanks for the link to "failed job" - looks like the files might be duplicate. Does paperless therefore automatically import one copy of document? and reject duplicates?

KaiBoos commented 2 years ago

Hello,

that is correct. If a new document is recognized as duplicate it will not be consumed again.

Kai

KaiBoos commented 2 years ago

If your problem has been solved, please close the issue. Thanks. Kai

jase31 commented 2 years ago

Ive just tried to import another file. Ive copied it to consume, and the process has run. However, this does not appear to be added to the system, and there is no entry under failed tasks. It seems to affect certain files (i.e certain documents never import - even if loaded and reprinted into a "new" pdf)

[2022-01-24 10:29:23,138] [INFO] [paperless.management.consumer] Adding /data/consume/HMRC (2021).pdf to the task queue.

[2022-01-24 10:29:27,471] [INFO] [paperless.consumer] Consuming HMRC (2021).pdf

[2022-01-24 10:29:27,635] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2022-01-24 10:29:28,771] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2022-01-24 10:29:28,791] [DEBUG] [paperless.consumer] Parsing HMRC (2021).pdf...

[2022-01-24 10:29:37,555] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /data/consume/HMRC (2021).pdf

[2022-01-24 10:29:40,684] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/data/consume/HMRC (2021).pdf', 'output_file': '/tmp/paperless/paperless-uc3mdxf0/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-uc3mdxf0/sidecar.txt'}

[2022-01-24 11:15:17,310] [DEBUG] [paperless.classifier] Gathering data from database...

[2022-01-24 11:15:27,144] [DEBUG] [paperless.tasks] Training data unchanged.

In admin, there is no entry under "failed" tasks.