Open jase31 opened 2 years ago
Hello,
can you please post the logs. Depending on your language its under "Protokoll" or "Log" on your paperless-ng webpage.
Most of the time, there is an error description why it couldnt be imported.
When ypu go to "Administration" and then "failed Jobs" you will also see the reason why it hasnt been imported.
-Kai
Logs: [2022-01-23 01:41:52,256] [INFO] [paperless.consumer] Consuming brake.pdf
[2022-01-23 01:41:53,044] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-01-23 01:41:59,654] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-01-23 01:41:59,722] [DEBUG] [paperless.consumer] Parsing brake.pdf...
[2022-01-23 01:42:21,818] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /data/consume/brake.pdf
[2022-01-23 01:42:33,183] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/data/consume/brake.pdf', 'output_file': '/tmp/paperless/paperless-4x_8sck6/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-4x_8sck6/sidecar.txt'}
[2022-01-23 02:15:25,860] [DEBUG] [paperless.classifier] Gathering data from database...
[2022-01-23 02:16:26,163] [DEBUG] [paperless.tasks] Training data unchanged.
Thanks for the link to "failed job" - looks like the files might be duplicate. Does paperless therefore automatically import one copy of document? and reject duplicates?
Hello,
that is correct. If a new document is recognized as duplicate it will not be consumed again.
Kai
If your problem has been solved, please close the issue. Thanks. Kai
Ive just tried to import another file. Ive copied it to consume, and the process has run. However, this does not appear to be added to the system, and there is no entry under failed tasks. It seems to affect certain files (i.e certain documents never import - even if loaded and reprinted into a "new" pdf)
[2022-01-24 10:29:23,138] [INFO] [paperless.management.consumer] Adding /data/consume/HMRC (2021).pdf to the task queue.
[2022-01-24 10:29:27,471] [INFO] [paperless.consumer] Consuming HMRC (2021).pdf
[2022-01-24 10:29:27,635] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-01-24 10:29:28,771] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-01-24 10:29:28,791] [DEBUG] [paperless.consumer] Parsing HMRC (2021).pdf...
[2022-01-24 10:29:37,555] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /data/consume/HMRC (2021).pdf
[2022-01-24 10:29:40,684] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/data/consume/HMRC (2021).pdf', 'output_file': '/tmp/paperless/paperless-uc3mdxf0/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-uc3mdxf0/sidecar.txt'}
[2022-01-24 11:15:17,310] [DEBUG] [paperless.classifier] Gathering data from database...
[2022-01-24 11:15:27,144] [DEBUG] [paperless.tasks] Training data unchanged.
In admin, there is no entry under "failed" tasks.
Describe the bug Most files added to consume directly are added, but certain pdf remain sitting in the consume directory and do not disappear / get processed.
Expected behavior A clear and concise description of what you expected to happen
Files added to consume should be added.