Open JonasGroeger opened 6 years ago
At which step would you like to implement this? In the bulk uploader Javascript script?
I'm not sure. I thought you might be able to point me in the correct direction?
Actually, there is already a text extraction step in the file upload flow (OCR for images, PDFTextStripper for PDF, plain text for .txt files, ...).
However, this step happens only when the user upload a file, which is obviously after the document creation. I don't know if automatically adding a tag to a document after uploading a file in it is a good idea (it might be confusing).
The text content of each file is saved in database (for search purpose), so I think we could add a new REST endpoint suggesting tags for a document, based on the current files uploaded in this document.
Something like: GET /document/{id}/tagsuggest
returning an array of tags. Then it's the user choice to accept the suggestion or not. Of course all of this can be automated by a script for bulk uploads.
What do you think of this?
Hello,
I'm also interested in such a feature.
However, even if it's technically harder to do, I think that having the tag suggestions before saving the document presents more benefits for the user. As a user, I want to tag the document while I'm creating it, not after I'm done.
Maybe the OCR process could be started as soon a something is dropped in the quick upload
dropzone so that we have more time to process the file and look for the tags.
I like the way automatic tagging is done in paperless. They allow you to set up some kind of "matchers" that will create tags depending on text that is found in the document ( explained here ).
If we use matchers, I think the tags suggested will be more predictable, and we can add them automatically on documents, without user input. This way it's solving the order issue (and the fact that OCR can take some time, I don't want users to just sit there 20sec waiting for tag suggestions).
Plus one for the matching solution for me too.
I am struggling with the workflow for mass uploading documents and tagging based on things it finds would be a great help in sorting through a mass of uploads.
I am new to the app so am not sure if i am missing the recommended workflow for uploading files?
I like the GET /document/{id}/tagsuggest
approach. Our workflow is mass creating of documents with a custom script or automatically creating the document form the scanner, and editing one by one manually after OCR processing. If there were a tag suggestion button on the edit form, that would be fine for our case.
A simple full text search on the new documents full content could be a good start. I tried that on the web interface with different document types I have. Simply copied the whole text from /api/file/.../data?size=content
of the new file to the search box on the UI. The results were just the related documents which had the same tags I had to set manually for the new document. That can be repeated for all the files of the document. Suggesting the tags that most of the documents have in common would be great. Or the top X tags by count. Something like that.
This would be amazing to have, maybe even on top of nltk, i.e. using machine learning. Paperless-ngx does this, and it's a really good way to have the system learn over time, and lessen the workload on the users.
This is what is preventing me from switching from paperless to teedy. I already have tags set up in paperless based on regular expressions and having to reproduce those manually for 400+ documents if I migrated to teedy is a showstopper for me.
Hello!
I'd like to be able to predict tags when bulk uploading files.
A simple version could extract all text using PDFBox
PDFTextStripper
and match for only words without numbers ([^\d\W]+
). Then we could look in the existing documents for matching documents and voilà. No fancy machine learning for the first part.Of course this is just a simplification. I'm willing to implement this but I'd like to have some information weather it will be accepted and where you'd start.