jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 358 forks source link

[BUG] Umlaut in filename prevents processed PDF from being displayed #838

Open benjaminfrank opened 3 years ago

benjaminfrank commented 3 years ago

Describe the bug I though I read about this somewhere already but could not find an issue via search hence opening a new one.

Filenames with umlauts prevent the processed file from being displayed. In my case setting PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title} caused the bug to appear. The file does exist though because if I remove the umlaut from the title the file displays as expected. The issue also only appears when the consumed document contains an umlaut. If I change title to contain the umlaut it still displays the PDF. In the second screenshot the two files are from the target folder where paperless-ng stores the processed PDFs. Förmlich.pdf is the uploaded file that contained an umlaut. Förmlich 2.pdf also contained an umlaut but I first removed it and then added it back. In this case the PDF was visible after removal as well as after adding it back via title. Thumbnails work in both cases. Looking at the dubbing tools in FF I can see that the document is properly requested and also returned to the UI, see screenshot 3.

To Reproduce Steps to reproduce the behavior:

  1. configure paperless-ng with PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}
  2. Add a file with an umlaut to paperless-ng, e.g., Förmlich.pdf
  3. Wait until processed
  4. Click edit on new document
  5. In the details page the PDF window does not appear.

Expected behavior The PDF should be displayed.

Screenshots Screenshot 2021-03-28 at 12 22 56 Screenshot 2021-03-28 at 12 30 14 Screenshot 2021-03-28 at 12 36 10

Webserver logs

[2021-03-28 10:21:04,783] [INFO] [paperless.consumer] Consuming Förmlich.pdf
[2021-03-28 10:21:04,785] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-03-28 10:21:04,796] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-03-28 10:21:04,802] [DEBUG] [paperless.consumer] Parsing Förmlich.pdf...
[2021-03-28 10:21:04,954] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-tetyu5bs
[2021-03-28 10:21:05,220] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-tetyu5bs', 'output_file': '/tmp/paperless/paperless-cb8h8gvy/archive.pdf', 'use_threads': True, 'jobs': 1, 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-cb8h8gvy/sidecar.txt'}
[2021-03-28 10:22:19,335] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-03-28 10:22:19,336] [DEBUG] [paperless.consumer] Generating thumbnail for Förmlich.pdf...
[2021-03-28 10:22:19,351] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-cb8h8gvy/archive.pdf[0] /tmp/paperless/paperless-cb8h8gvy/convert.png
[2021-03-28 10:22:19,383] [WARNING] [paperless.parsing] Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!
[2021-03-28 10:22:22,190] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-cb8h8gvy/gs_out.png /tmp/paperless/paperless-cb8h8gvy/convert.png
[2021-03-28 10:22:22,534] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-cb8h8gvy/convert.png -out /tmp/paperless/paperless-cb8h8gvy/thumb_optipng.png
[2021-03-28 10:22:44,542] [DEBUG] [paperless.consumer] Saving record to database
[2021-03-28 10:22:44,687] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-upload-tetyu5bs
[2021-03-28 10:22:45,550] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-cb8h8gvy
[2021-03-28 10:22:45,553] [INFO] [paperless.consumer] Document 2021-03-28 Förmlich consumption finished

Relevant information

jonaswinkler commented 3 years ago

So, as far as I can tell everything is working, except for the contenxt-disposition header, and that causes the file to not display properly. I'm getting normal behavior over here when adding files with umlauts and filename format enabled. Let's see...

Also, there's an option in the web settings to switch the PDF viewer. Could you try that please?

benjaminfrank commented 3 years ago

sorry, didnt see your edit. updated to 1.4 but the problem remains: thumbnail is there, PDF not. Switching the PDF viewer fixes the issue and the PDF is displayed properly.

tcurdt commented 3 years ago

I was about to open a feature request for this. IMO it would be nice if the filename on disk would just be using the ASCII charset - maybe even without special characters.