jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

[Other] Use smart_open to allow for "non filesystem" storage of documents #1600

Open a17t opened 2 years ago

a17t commented 2 years ago

Hello everyone,

in search for a "personal document storage" I stumbled upon the original paperless a few years ago and used it for a while. A few weeks ago I picked up that challenge of digitalizing personal documents once again and found paperless-ng. So far great stuff!

But I would like to abstract storage from raw filesystem storage. I have different things in mind, but as a starting point I thought about S3 compatible storage. While researching I found smart_open (https://github.com/RaRe-Technologies/smart_open) which overrides the open function to allow urls to different storage systems, one of those being S3, in that function call.

So I quickly included it into a fork of paperless-ng and voila, paperless docs are stored in a local S3 storage.

@jonaswinkler What do you think about that change? Feel free to look into my fork, it should just contain a few changes.

I would really like to bring this back to mainline paperless-ng, especially with the perspective of introducing further storage backends to paperless-ng. (e.g. ms onedrive, or other managed cloud storage)

Looking forward to some discussion, and stay healthy

Simon

dwehrmann commented 2 years ago

hi @a17t This sounds like what I've been looking for, great to see someone has already thought about s3 compatible storage adapters. Can you point me to your fork? I'd like to give it a spin on a test server here.

a17t commented 2 years ago

Hey @dwehrmann, feel free to take a look into my fork of paperless-ng. The branch smartOpen contains my (trivial) changes.

Please be aware, its definitely in a PoC state, so do not use it for any important stuff.

To use S3 (or any storage provider supported by smart_open) just set the corresponding url in the config. For now the checks for existence are only implemented updated to allow for s3. I just tested it by setting PAPERLESS_MEDIA_ROOT to a s3u:// url, which was enough to have paperless-ng store all document data in reference s3 bucket.

funkybunch commented 2 years ago

@a17t have you been able to build your fork into a docker image? I'd love to run this alongside the original container to test things out. My docker machine has limited storage so I try to put everything on my Minio machine. This S3 compatibility feature is exactly what I'm looking for.

a17t commented 2 years ago

@funkybunch I will upload an image to dockerhub so you can try. Please be aware, this is in a PoC state, it might break or do strange stuff.

a17t commented 2 years ago

https://hub.docker.com/r/a17t/paperless-ngx-s3

here you go

a17t commented 2 years ago

@funkybunch did it work for you? Did you experience any issues?