jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

plain text files are not being consumed (OSError: cannot open resource) #197

Closed s-oliver closed 3 years ago

s-oliver commented 3 years ago

I'm currently evaluating your project and so far, I'm liking it a lot, thank you for providing it!

I just stumbled across an issue with plain text files though. From reading the docs and looking at closed github issues about this topic, I assumed I could just add arbitrary text files and they would show up as documents. However, when I added my first .txt file, it appears it's stuck somewhere in the consumption process. The logs only show this:

12/27/20, 11:44 PM DEBUG Parsing gitlab-recovery-codes.txt...
12/27/20, 11:44 PM DEBUG Parser: TextDocumentParser based on mime type text/plain
12/27/20, 11:44 PM INFO Consuming gitlab-recovery-codes.txt

No thumbnail was generated. The files content is (as the file name states) a list of gitlab account recovery codes: 10 lines, each one in the format ^[a-f0-9]{16}$ So nothing special I assume

Looking further at the docker logs, they say the file was not found (I added it through the UI, just like a bunch of files before)

23:44:53 [Q] INFO Process-1:1 processing [gitlab-recovery-codes.txt]
INFO 2020-12-27 23:44:53,796 loggers Consuming gitlab-recovery-codes.txt
23:44:54 [Q] ERROR Failed [gitlab-recovery-codes.txt] - cannot open resource : Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
    res = f(*task["args"], **task["kwargs"])
  File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file
    override_tag_ids=override_tag_ids)
  File "/usr/src/paperless/src/documents/consumer.py", line 135, in try_consume_file
    self.path, mime_type)
  File "/usr/src/paperless/src/documents/parsers.py", line 235, in get_optimised_thumbnail
    thumbnail = self.get_thumbnail(document_path, mime_type)
  File "/usr/src/paperless/src/paperless_text/parsers.py", line 27, in get_thumbnail
    layout_engine=ImageFont.LAYOUT_BASIC)
  File "/usr/local/lib/python3.7/site-packages/PIL/ImageFont.py", line 836, in truetype
    return freetype(font)
  File "/usr/local/lib/python3.7/site-packages/PIL/ImageFont.py", line 833, in freetype
    return FreeTypeFont(font, size, index, encoding, layout_engine)
  File "/usr/local/lib/python3.7/site-packages/PIL/ImageFont.py", line 194, in __init__
    font, size, index, encoding, layout_engine=layout_engine
OSError: cannot open resource

Any idea what's happening here?

jonaswinkler commented 3 years ago

OH.

Thank you for reporting. The plain text parser creates a thumbnail by printing the content of the file on a picture, using a generic serif font. However, I need to provide that font with the docker image. Of course that runs fine when testing (font is installed on all test machines). That's the missing file error you're getting.

I'll fix that with the next release.

s-oliver commented 3 years ago

Thanks for the quick reply and fix in the next release.

small fyi for you and future googlers: I'm running this on a Synology Diskstation 1513+ (Intel Atom processor) that uses the proprietary linux dist "DSM" (disk station manager) and after installing its docker package (https://www.synology.com/en-global/dsm/packages/Docker), paperless-ng runs absolutely fine from the provided release files using postgres as storage. The consumption is not super quick but the speed is absolutely adequate. Will see how it plays out with more documents (currently only about 100) should I decide to go all in and put ALL my paper folders into it (I'm still very scared of the work that would mean)

jonaswinkler commented 3 years ago

Can't do anything about OCR speed, sadly. I'm running with ~3000 documents, and its working fine. Document lists with 100 items are a little slow, but there's lots of things to render.

Regarding making all documents digital: I never looked back.