[BUG] Documents getting skipped

Burschi500 commented 3 years ago

Describe the bug When adding multiple documents (.pdf) via the web interface, sometimes random files are ignored or dropped. I found no hint in the logs, unfortunately.

To Reproduce Upload more than 10 files

Host OS Ubuntu 20.04, bare metal install
Guest OS EndeavourOS (Arch based)
Browser Firefox 88.0.1
Installation method bare metal
nginx reverse proxy

I know this is very vague, but maybe i get an idea how to solve this. Could it be some sort of timeout?

jonaswinkler commented 3 years ago

I've spotted a couple issues with PDF text extraction that might be related to this. Could you confirm that your ignored files get ignored again on subsequent uploads?

Burschi500 commented 3 years ago

No, repeating the upload gets all files added. It seems to be the case mainly for mobile upload, but i saw this (in trial runs) also when uploading throught the website (local files). Uploading by putting the files directly in consume works ... better, but not reliable. I will try again later. The worst thing is, that you dont get feedback about the missing files, and especially in the initial setup phase where you need to upload piles of .pdf it is hard to check every document if it got uploaded. Btw. i have this running in a Proxmox lxc with said Ubuntu and nginx on the same host as proxy if that helps.

Burschi500 commented 3 years ago

Happened just now, appending logs. The document Anmeldung_Ersatzfreizeit_X_B.pdf was not added Protocol paperless-ng:

[2021-05-14 12:22:53,157] [INFO] [paperless.consumer] Consuming Anmeldung_Ersatzfreizeit_X_A.pdf
[2021-05-14 12:22:53,161] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-14 12:22:53,174] [INFO] [paperless.consumer] Consuming Anmeldung_Ersatzfreizeit_X_B.pdf
[2021-05-14 12:22:53,178] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-14 12:22:53,192] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-14 12:22:53,192] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-14 12:22:53,196] [DEBUG] [paperless.consumer] Parsing Anmeldung_Ersatzfreizeit_X_A.pdf...
[2021-05-14 12:22:53,198] [DEBUG] [paperless.consumer] Parsing Anmeldung_Ersatzfreizeit_X_B.pdf...
[2021-05-14 12:23:09,390] [INFO] [paperless.consumer] Consuming Anmeldeformular Ersatzfreizeit X 2021.pdf
[2021-05-14 12:23:09,392] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-14 12:23:09,408] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-14 12:23:09,416] [DEBUG] [paperless.consumer] Parsing Anmeldeformular Ersatzfreizeit X 2021.pdf...
[2021-05-14 12:23:09,576] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-zyvnjzex
[2021-05-14 12:23:09,892] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-zyvnjzex', 'output_file': '/tmp/paperless/paperless-4q4rcm2f/archive.pdf', 'use_threads': True, 'jobs': 1, 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-4q4rcm2f/sidecar.txt'}
[2021-05-14 12:23:11,277] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2021-05-14 12:23:11,367] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-4q4rcm2f/archive.pdf
[2021-05-14 12:23:11,367] [DEBUG] [paperless.consumer] Generating thumbnail for Anmeldeformular Ersatzfreizeit X 2021.pdf...
[2021-05-14 12:23:11,385] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-4q4rcm2f/archive.pdf[0] /tmp/paperless/paperless-4q4rcm2f/convert.png
[2021-05-14 12:23:12,786] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-4q4rcm2f/convert.png -out /tmp/paperless/paperless-4q4rcm2f/thumb_optipng.png
[2021-05-14 12:23:13,913] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-6ce2429w
[2021-05-14 12:23:14,083] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-6ce2429w', 'output_file': '/tmp/paperless/paperless-dr8wpu_j/archive.pdf', 'use_threads': True, 'jobs': 1, 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-dr8wpu_j/sidecar.txt'}
[2021-05-14 12:23:15,951] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-14 12:23:16,050] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-upload-zyvnjzex
[2021-05-14 12:23:16,056] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-4q4rcm2f
[2021-05-14 12:23:16,057] [INFO] [paperless.consumer] Document 2021-05-14 Anmeldeformular Ersatzfreizeit X 2021 consumption finished
[2021-05-14 12:23:33,979] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-05-14 12:23:33,980] [DEBUG] [paperless.consumer] Generating thumbnail for Anmeldung_Ersatzfreizeit_X_A.pdf...
[2021-05-14 12:23:33,986] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-dr8wpu_j/archive.pdf[0] /tmp/paperless/paperless-dr8wpu_j/convert.png
[2021-05-14 12:23:37,815] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-dr8wpu_j/convert.png -out /tmp/paperless/paperless-dr8wpu_j/thumb_optipng.png
[2021-05-14 12:23:44,302] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-14 12:23:44,332] [INFO] [paperless.handlers] Assigning document type Antrag to 2021-05-14 Anmeldung_Ersatzfreizeit_X_A
[2021-05-14 12:23:44,389] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-upload-6ce2429w
[2021-05-14 12:23:44,395] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-dr8wpu_j
[2021-05-14 12:23:44,396] [INFO] [paperless.consumer] Document 2021-05-14 Anmeldung_Ersatzfreizeit_X_A consumption finished

nginx error.log

2021/05/14 05:23:55 [error] 148#148: *3 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.154, server: , request: "GET /ws/status/ HTTP/1.1", upstream: "http://[::1]:8000/ws/status/", host: "192.168.20.236"
2021/05/14 10:22:41 [error] 147#147: *14 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /static/frontend/de-DE/runtime.js HTTP/1.1", upstream: "http://[::1]:8000/static/frontend/de-DE/runtime.js", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:22:41 [error] 148#148: *18 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /static/frontend/de-DE/main.js HTTP/1.1", upstream: "http://[::1]:8000/static/frontend/de-DE/main.js", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:23:09 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "POST /api/documents/post_document/ HTTP/1.1", upstream: "http://[::1]:8000/api/documents/post_document/", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:23:16 [error] 148#148: *18 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/tags/?page=1&page_size=100000 HTTP/1.1", upstream: "http://[::1]:8000/api/tags/?page=1&page_size=100000", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:23:52 [error] 148#148: *45 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/document_types/?page=1&page_size=100000 HTTP/1.1", upstream: "http://[::1]:8000/api/document_types/?page=1&page_size=100000", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/documents"
2021/05/14 10:23:55 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/documents/?page=1&page_size=50&ordering=-archive_serial_number HTTP/1.1", upstream: "http://[::1]:8000/api/documents/?page=1&page_size=50&ordering=-archive_serial_number", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/documents"
2021/05/14 10:25:04 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "PUT /api/documents/300/ HTTP/1.1", upstream: "http://[::1]:8000/api/documents/300/", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/documents/300"
2021/05/14 10:25:05 [error] 148#148: *18 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/documents/301/suggestions/ HTTP/1.1", upstream: "http://[::1]:8000/api/documents/301/suggestions/", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/documents/301"
2021/05/14 10:25:26 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/statistics/ HTTP/1.1", upstream: "http://[::1]:8000/api/statistics/", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:25:44 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /static/frontend/de-DE/runtime.js HTTP/1.1", upstream: "http://[::1]:8000/static/frontend/de-DE/runtime.js", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:25:44 [error] 148#148: *18 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /static/frontend/de-DE/polyfills.js HTTP/1.1", upstream: "http://[::1]:8000/static/frontend/de-DE/polyfills.js", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"

Burschi500 commented 3 years ago

Addition - in this case the document is not added, even after an additional upload. I raised the client_max_body_size in nginx.conf, but no result...

Burschi500 commented 3 years ago

Sorry for spam, but in the ./admin/django_q/failure/ list i think there is a repeating message:

bytes must be in range(0, 256) : Traceback (most recent call last):
File "/opt/paperless/.local/lib/python3.8/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/opt/paperless/src/documents/tasks.py", line 74, in consume_file
document = Consumer().try_consume_file(
File "/opt/paperless/src/documents/consumer.py", line 248, in try_consume_file
document_parser.parse(self.path, mime_type, self.filename)
File "/opt/paperless/src/paperless_tesseract/parsers.py", line 212, in parse
text_original = self.extract_text(None, document_path)
File "/opt/paperless/src/paperless_tesseract/parsers.py", line 120, in extract_text
stripped = post_process_text(pdfminer_extract_text(pdf_file))
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/high_level.py", line 121, in extract_text
interpreter.process_page(page)
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 919, in execute
(_, obj) = parser.nextobject()
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 567, in nextobject
(pos, token) = self.nexttoken()
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 494, in nexttoken
self.charpos = self._parse1(self.buf, self.charpos)
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 452, in _parse_string_1
self._curtoken += bytes((int(self.oct, 8),))
ValueError: bytes must be in range(0, 256)

I think thats the actual problem...

jonaswinkler commented 3 years ago

Well, this is an issue with https://github.com/pdfminer/pdfminer.six. It's what paperless uses to extract text from PDF documents.

I've got a workaround in mind. If you're able to share the file, you should open an issue over there.

Burschi500 commented 3 years ago

No, as these contain private data i wont share them. The worst thing is that i dont get notified that this .pdf got not added, so i have to go through all my uploads (also in the future i guess) to make sure that alle documents get added. Not ideal. Furthermore the .pdf got deleted from the /consume directory which means it is lost. So:

1) Documents dont get added/archived 2) No notification / warning 3) .pdf lost / getting deleted

That means if you decide to scan a document with mobile app for example and then throw it away in the assumption that you already archived it, its lost forever.

I think the situation could not be worse regarding this issue. :(

jonaswinkler commented 3 years ago

The worst thing is that i dont get notified that this .pdf got not added

Well yes, that's the deal with uncaught exceptions. I've now seen a couple issues with pdfminer, maybe I have to switch to something else. On the other hand, PDF is a pretty wild format and I doubt there's a text extraction library that's able to deal with every edge case.

I've changed the code to catch all pdfminer exceptions, intended and unintended ones. Worst case, a document might get added with its "content" field unpopulated, but the consumption process won't fail anymore due to the above issue.

Furthermore the .pdf got deleted from the /consume directory which means it is lost.

I was unable to observe this behavior. If you use the web frontend or the mobile apps, the uploaded files never get into the consume directory in the first place.

Next update soon, probably today/tomorrow.

Burschi500 commented 3 years ago

Thanks for taking care.

Maybe an independent process for verifying the addition of dokuments could be in place. In the protocol/log there was a message about the document getting parsed, but the message about addition is missing. Maybe you can use that; even if the notification about the failed addition of the doc is only returned much later (hour?) it could still be useful...

Thanks again for fast addressing the problem (and also for the excellent manual, which is not that often seen in projects)

jonaswinkler commented 3 years ago

Alright, 1.4.3 is currently building. I hope this will resolve your issue.

Burschi500 commented 3 years ago

Thank you, 1.4.3 fixed it. I still get a warning in the protocol, but the document gets added nevertheless. Text is also recognized...

jonaswinkler commented 3 years ago

The warning is there because this is not a fix, but rather a workaround for an issue with pdfminer.six.

jonaswinkler / paperless-ng

[BUG] Documents getting skipped #1007