Closed Burschi500 closed 3 years ago
I've spotted a couple issues with PDF text extraction that might be related to this. Could you confirm that your ignored files get ignored again on subsequent uploads?
No, repeating the upload gets all files added. It seems to be the case mainly for mobile upload, but i saw this (in trial runs) also when uploading throught the website (local files). Uploading by putting the files directly in consume works ... better, but not reliable. I will try again later. The worst thing is, that you dont get feedback about the missing files, and especially in the initial setup phase where you need to upload piles of .pdf it is hard to check every document if it got uploaded. Btw. i have this running in a Proxmox lxc with said Ubuntu and nginx on the same host as proxy if that helps.
Happened just now, appending logs. The document Anmeldung_Ersatzfreizeit_X_B.pdf was not added Protocol paperless-ng:
[2021-05-14 12:22:53,157] [INFO] [paperless.consumer] Consuming Anmeldung_Ersatzfreizeit_X_A.pdf
[2021-05-14 12:22:53,161] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-14 12:22:53,174] [INFO] [paperless.consumer] Consuming Anmeldung_Ersatzfreizeit_X_B.pdf
[2021-05-14 12:22:53,178] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-14 12:22:53,192] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-14 12:22:53,192] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-14 12:22:53,196] [DEBUG] [paperless.consumer] Parsing Anmeldung_Ersatzfreizeit_X_A.pdf...
[2021-05-14 12:22:53,198] [DEBUG] [paperless.consumer] Parsing Anmeldung_Ersatzfreizeit_X_B.pdf...
[2021-05-14 12:23:09,390] [INFO] [paperless.consumer] Consuming Anmeldeformular Ersatzfreizeit X 2021.pdf
[2021-05-14 12:23:09,392] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-14 12:23:09,408] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-14 12:23:09,416] [DEBUG] [paperless.consumer] Parsing Anmeldeformular Ersatzfreizeit X 2021.pdf...
[2021-05-14 12:23:09,576] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-zyvnjzex
[2021-05-14 12:23:09,892] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-zyvnjzex', 'output_file': '/tmp/paperless/paperless-4q4rcm2f/archive.pdf', 'use_threads': True, 'jobs': 1, 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-4q4rcm2f/sidecar.txt'}
[2021-05-14 12:23:11,277] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2021-05-14 12:23:11,367] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-4q4rcm2f/archive.pdf
[2021-05-14 12:23:11,367] [DEBUG] [paperless.consumer] Generating thumbnail for Anmeldeformular Ersatzfreizeit X 2021.pdf...
[2021-05-14 12:23:11,385] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-4q4rcm2f/archive.pdf[0] /tmp/paperless/paperless-4q4rcm2f/convert.png
[2021-05-14 12:23:12,786] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-4q4rcm2f/convert.png -out /tmp/paperless/paperless-4q4rcm2f/thumb_optipng.png
[2021-05-14 12:23:13,913] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-6ce2429w
[2021-05-14 12:23:14,083] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-6ce2429w', 'output_file': '/tmp/paperless/paperless-dr8wpu_j/archive.pdf', 'use_threads': True, 'jobs': 1, 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-dr8wpu_j/sidecar.txt'}
[2021-05-14 12:23:15,951] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-14 12:23:16,050] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-upload-zyvnjzex
[2021-05-14 12:23:16,056] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-4q4rcm2f
[2021-05-14 12:23:16,057] [INFO] [paperless.consumer] Document 2021-05-14 Anmeldeformular Ersatzfreizeit X 2021 consumption finished
[2021-05-14 12:23:33,979] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-05-14 12:23:33,980] [DEBUG] [paperless.consumer] Generating thumbnail for Anmeldung_Ersatzfreizeit_X_A.pdf...
[2021-05-14 12:23:33,986] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-dr8wpu_j/archive.pdf[0] /tmp/paperless/paperless-dr8wpu_j/convert.png
[2021-05-14 12:23:37,815] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-dr8wpu_j/convert.png -out /tmp/paperless/paperless-dr8wpu_j/thumb_optipng.png
[2021-05-14 12:23:44,302] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-14 12:23:44,332] [INFO] [paperless.handlers] Assigning document type Antrag to 2021-05-14 Anmeldung_Ersatzfreizeit_X_A
[2021-05-14 12:23:44,389] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-upload-6ce2429w
[2021-05-14 12:23:44,395] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-dr8wpu_j
[2021-05-14 12:23:44,396] [INFO] [paperless.consumer] Document 2021-05-14 Anmeldung_Ersatzfreizeit_X_A consumption finished
nginx error.log
2021/05/14 05:23:55 [error] 148#148: *3 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.154, server: , request: "GET /ws/status/ HTTP/1.1", upstream: "http://[::1]:8000/ws/status/", host: "192.168.20.236"
2021/05/14 10:22:41 [error] 147#147: *14 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /static/frontend/de-DE/runtime.js HTTP/1.1", upstream: "http://[::1]:8000/static/frontend/de-DE/runtime.js", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:22:41 [error] 148#148: *18 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /static/frontend/de-DE/main.js HTTP/1.1", upstream: "http://[::1]:8000/static/frontend/de-DE/main.js", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:23:09 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "POST /api/documents/post_document/ HTTP/1.1", upstream: "http://[::1]:8000/api/documents/post_document/", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:23:16 [error] 148#148: *18 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/tags/?page=1&page_size=100000 HTTP/1.1", upstream: "http://[::1]:8000/api/tags/?page=1&page_size=100000", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:23:52 [error] 148#148: *45 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/document_types/?page=1&page_size=100000 HTTP/1.1", upstream: "http://[::1]:8000/api/document_types/?page=1&page_size=100000", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/documents"
2021/05/14 10:23:55 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/documents/?page=1&page_size=50&ordering=-archive_serial_number HTTP/1.1", upstream: "http://[::1]:8000/api/documents/?page=1&page_size=50&ordering=-archive_serial_number", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/documents"
2021/05/14 10:25:04 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "PUT /api/documents/300/ HTTP/1.1", upstream: "http://[::1]:8000/api/documents/300/", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/documents/300"
2021/05/14 10:25:05 [error] 148#148: *18 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/documents/301/suggestions/ HTTP/1.1", upstream: "http://[::1]:8000/api/documents/301/suggestions/", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/documents/301"
2021/05/14 10:25:26 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /api/statistics/ HTTP/1.1", upstream: "http://[::1]:8000/api/statistics/", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:25:44 [error] 147#147: *15 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /static/frontend/de-DE/runtime.js HTTP/1.1", upstream: "http://[::1]:8000/static/frontend/de-DE/runtime.js", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
2021/05/14 10:25:44 [error] 148#148: *18 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.30.150, server: , request: "GET /static/frontend/de-DE/polyfills.js HTTP/1.1", upstream: "http://[::1]:8000/static/frontend/de-DE/polyfills.js", host: "pm-doku.dmz", referrer: "http://pm-doku.dmz/dashboard"
Addition - in this case the document is not added, even after an additional upload. I raised the client_max_body_size in nginx.conf, but no result...
Sorry for spam, but in the ./admin/django_q/failure/ list i think there is a repeating message:
bytes must be in range(0, 256) : Traceback (most recent call last):
File "/opt/paperless/.local/lib/python3.8/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/opt/paperless/src/documents/tasks.py", line 74, in consume_file
document = Consumer().try_consume_file(
File "/opt/paperless/src/documents/consumer.py", line 248, in try_consume_file
document_parser.parse(self.path, mime_type, self.filename)
File "/opt/paperless/src/paperless_tesseract/parsers.py", line 212, in parse
text_original = self.extract_text(None, document_path)
File "/opt/paperless/src/paperless_tesseract/parsers.py", line 120, in extract_text
stripped = post_process_text(pdfminer_extract_text(pdf_file))
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/high_level.py", line 121, in extract_text
interpreter.process_page(page)
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 919, in execute
(_, obj) = parser.nextobject()
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 567, in nextobject
(pos, token) = self.nexttoken()
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 494, in nexttoken
self.charpos = self._parse1(self.buf, self.charpos)
File "/opt/paperless/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 452, in _parse_string_1
self._curtoken += bytes((int(self.oct, 8),))
ValueError: bytes must be in range(0, 256)
I think thats the actual problem...
Well, this is an issue with https://github.com/pdfminer/pdfminer.six. It's what paperless uses to extract text from PDF documents.
I've got a workaround in mind. If you're able to share the file, you should open an issue over there.
No, as these contain private data i wont share them. The worst thing is that i dont get notified that this .pdf got not added, so i have to go through all my uploads (also in the future i guess) to make sure that alle documents get added. Not ideal. Furthermore the .pdf got deleted from the /consume directory which means it is lost. So:
1) Documents dont get added/archived 2) No notification / warning 3) .pdf lost / getting deleted
That means if you decide to scan a document with mobile app for example and then throw it away in the assumption that you already archived it, its lost forever.
I think the situation could not be worse regarding this issue. :(
The worst thing is that i dont get notified that this .pdf got not added
Well yes, that's the deal with uncaught exceptions. I've now seen a couple issues with pdfminer, maybe I have to switch to something else. On the other hand, PDF is a pretty wild format and I doubt there's a text extraction library that's able to deal with every edge case.
I've changed the code to catch all pdfminer exceptions, intended and unintended ones. Worst case, a document might get added with its "content" field unpopulated, but the consumption process won't fail anymore due to the above issue.
Furthermore the .pdf got deleted from the /consume directory which means it is lost.
I was unable to observe this behavior. If you use the web frontend or the mobile apps, the uploaded files never get into the consume directory in the first place.
Next update soon, probably today/tomorrow.
Thanks for taking care.
Maybe an independent process for verifying the addition of dokuments could be in place. In the protocol/log there was a message about the document getting parsed, but the message about addition is missing. Maybe you can use that; even if the notification about the failed addition of the doc is only returned much later (hour?) it could still be useful...
Thanks again for fast addressing the problem (and also for the excellent manual, which is not that often seen in projects)
Alright, 1.4.3 is currently building. I hope this will resolve your issue.
Thank you, 1.4.3 fixed it. I still get a warning in the protocol, but the document gets added nevertheless. Text is also recognized...
The warning is there because this is not a fix, but rather a workaround for an issue with pdfminer.six.
Describe the bug When adding multiple documents (.pdf) via the web interface, sometimes random files are ignored or dropped. I found no hint in the logs, unfortunately.
To Reproduce Upload more than 10 files
I know this is very vague, but maybe i get an idea how to solve this. Could it be some sort of timeout?