the-paperless-project / paperless

Scan, index, and archive all of your paper documents
GNU General Public License v3.0
7.86k stars 499 forks source link

Feature: FileUpload #266

Open dev-rke opened 7 years ago

dev-rke commented 7 years ago

Hi @danielquinn,

it would be great to have some in-built simple file upload. Then it would be easy to upload files from any device manually, e.g. mobile devices.

Further thoughts: when there is a working file upload, it would be possible to add some webcam capturing via JavaScript with less effort, e.g. for mobile devices. I'd try to develop such a feature. Then you could take a photo of documents with your mobile browser and let them index in the application without needing a mobile app for that.

What do you think? Would it be possible to implement such a feature?

nebulade commented 7 years ago

Very simliar to #196

danielquinn commented 7 years ago

Hello @dev-rke, those are some pretty elaborate ideas, but if you're willing to code them and they don't affect how Paperless works for other people, then I'd be happy to merge them.

The key to writing a different/better front-end is in my comment in #196:

For the UI, I don't think much is required since all you'd have to do is write a Django app and then inject that app into INSTALLED_APPS ahead of the documents app.

As Paperless is Django-based, you can modify/override its behaviour like any other Django project: add an app, and insert it into INSTALLED_APPS above the app(s) you want to override. After that, you can write whatever you want.

As for the webcam stuff, that sounds crazy/awesome but I'd caution you that OCR is a fickle mistress. Getting it to work on properly scanned documents is a crapshoot in itself, but taken with a webcam? You're asking for trouble. You have to worry about making sure the document is well (and evenly) lit, that there's no shake (causes blurring), and that the resolution is high enough to have something worth scanning. I mean, you're welcome to try, but you may find that you're building an elaborate system for a method that doesn't yield very good results.

The truth is, I never intended for Paperless to be so labour-intensive. I wrote it to passively consume documents as they appear in the consumuption directory so I wouldn't have to sit there clicking on a form to upload each document. With that in mind, I'm not likely to spend time on adding such a feature, but I won't prevent someone else from doing it and issuing a PR so long as it doesn't break existing functionality.

To that end, I'm going to tag this as I won't do it, but contributions are welcome, and if you manage to hack something together, send me that PR so we can work out how to get your code into the project :-)

lenucksi commented 7 years ago

I just recently came across this: www.astrojack.com/scanning-and-ocr-ing-a-paper-receipt/ I think it came from a ticket in openpaperwork, a Desktop-GUI oriented DMS. Maybe that would aid some of the hassles in scan processing. Aside from that: Definitely a nice idea.

maur commented 6 years ago

Android has some app for that (office lens for example), which are pretty nicely cropping document from general cam view area. Many of them has option to sync with dropbox or onedrive. So for myself I think I'll add script to synchronize that dropbox/onedrive with paperless watch directory. :) If you install such an app, you'll see it's way more complicated than just taking a photo.. so why reinvent the well, at least if it's mostly for personal usage?

retog commented 6 years ago

It seems that the functionality is there: http://paperless.readthedocs.io/en/latest/consumption.html#consumption-http. The form seems to be missing though, but it should be possible to use the back end, e.g. with curl.

When posting a document I get a 202 response, here's the curl command I use:

curl -u root:mypassword -F "correspondent=me" -F "title=A test" -F "document=@test.pdf;type=application/pdf" http://host:8000/push
danielquinn commented 6 years ago

@retog is right, the basic functionality is there: a POST endpoint that will push a form submission to the consumer. The only thing to do is to add a form to the UI and include some validation to prevent people from trying submit invalid titles & correspondent names.

Unfortunately, I don't have a lot of time lately, so I'm going to tag this one help-wanted to see if there's interest. I'm happy to help with pointers/advice for anyone so inclined.

retog commented 6 years ago

@danielquinn I was excited after getting the 202 response, unfortunately the document didn't actually show up in the paperless UI. I'm wondering what went wrong. Unfortunately I also didn't have the time to investigate much, but maybe you have an idea?

danielquinn commented 6 years ago

The endpoint doesn't OCR the file on-demand. Rather it pushes the file into the consumption directory and names it according to the other parameters you supplied. The 202 response only means the file should now be in that directory, but you have to wait for the consumer to finish doing its thing.

So, if you got the 202 and your consumer is running, then there's probably a bug somewhere, but if your consumer isn't running, or you just didn't give it enough time to do its job, then your file may be waiting for youw now :-)

retog commented 6 years ago

Ok, then I'll have to check the configuration an the consumption directory in the docker container. The file is still not there now...

danielquinn commented 6 years ago

It sounds like you've found a bug, but to be sure, check on these things:

retog commented 6 years ago

@danielquinn, I found that my /consume directory in the docker (I didn't mount a host volume here) indeed contains files uploaded:

-rw-r--r--    1 paperles paperles    472092 Mar 29 14:20 Max Muster - A test.pdf
-rw-r--r--    1 paperles paperles      1313 Mar 28 15:02 me - A test.jpg
-rw-r--r--    1 paperles paperles    472092 Mar 28 15:06 me - A test.pdf

I tried adding a file directly to the dir by downloading a file into it with wget (as user paperless), I can't see any processing of the file. Not sure what a correct name is.

danielquinn commented 6 years ago

If the files are in there and getting placed properly, it sounds like the problem is the consumer service. I have new questions:

  1. Is the consumer service running?
    1. Is it consuming from the right directory?
    2. Does the user it's running as have read/write permissions to that directory?
    3. What's the output of the consumer? You should be able to get this with docker logs <container name>.
retog commented 6 years ago

Thanks @danielquinn, I'm definitively not understand enough yet to succeed getting things running in rancher. I tried with docker-compose as per the manual and things work just fine.But how do consumer and webserver interact? Is it just by the shared host directory or does it attempt network communication and thus assuming the names given in the docker-compose?

The log of the consumer in my rancher attempt ends with:

5/2/2018 1:57:01 PM  Applying sessions.0001_initial... OK
5/2/2018 2:25:54 PMsettings.PASSPHRASE is unset.  Input passphrase: 2018-05-02T12:25:56.797253004Z Operations to perform:
5/2/2018 2:25:56 PM  Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
5/2/2018 2:25:56 PMRunning migrations:
5/2/2018 2:25:56 PM  No migrations to apply.

The environment variable PAPERLESS_PASSPHRASE is however set for both cosumer and webserver.

danielquinn commented 6 years ago

Communication between the consumer and webserver is very limited. Basically they're two separate processes that read/write to the database and that's it. So the consumer watches a directory, consumes what's there and writes to the db. The webserver just reads from the db and puts what it finds there on the screen. They two services don't really talk to each other at all.

As for your passphrase problem, that's a hard one. Can you jump into your docker instances and run env to see what you get? Maybe somewhere the environment variables aren't being passed in somehow:

$ docker-compose exec webserver env $ docker-compose exec collector env

retog commented 6 years ago

Hi Daniel, you were right about the environment variables. Things are working very nicely now.

As for file-upload, I found the easiest way to expose the consume folder as a WebDAV share. After trying a dozen of WebDAV images that didn't work (with windows clients) I ended up with the following (using traefik as reverse proxy)


    webdav:
        image: telfix/pywebdav
        volumes:
           - ./consume:/share
        labels:
           - 'traefik.backend=webdav'
           - 'traefik.port=80'
           - 'traefik.frontend.rule=Host:dav.docs.example.com'
           - "traefik.frontend.entryPoints=https"
           - "traefik.frontend.auth.basic=[....]"
danielquinn commented 6 years ago

Interesting... I don't understand how webdav works, and don't know what that labels: section defines, but if you would like to contribute some documentation (just edit the files in the docs/ directory) to help people setup something similar, it would be appreciated.

retog commented 6 years ago

The labels configure traefik a reverse-proxy that takes care of let's encrypt certificates.