sismics / docs

Lightweight document management system packed with all the features you can expect from big expensive solutions
https://teedy.io
GNU General Public License v2.0
1.95k stars 491 forks source link

Predict Tags #234

Open JonasGroeger opened 6 years ago

JonasGroeger commented 6 years ago

Hello!

I'd like to be able to predict tags when bulk uploading files.

A simple version could extract all text using PDFBox PDFTextStripper and match for only words without numbers ([^\d\W]+). Then we could look in the existing documents for matching documents and voilà. No fancy machine learning for the first part.

Of course this is just a simplification. I'm willing to implement this but I'd like to have some information weather it will be accepted and where you'd start.

jendib commented 6 years ago

At which step would you like to implement this? In the bulk uploader Javascript script?

JonasGroeger commented 6 years ago

I'm not sure. I thought you might be able to point me in the correct direction?

jendib commented 6 years ago

Actually, there is already a text extraction step in the file upload flow (OCR for images, PDFTextStripper for PDF, plain text for .txt files, ...).

However, this step happens only when the user upload a file, which is obviously after the document creation. I don't know if automatically adding a tag to a document after uploading a file in it is a good idea (it might be confusing).

The text content of each file is saved in database (for search purpose), so I think we could add a new REST endpoint suggesting tags for a document, based on the current files uploaded in this document.

Something like: GET /document/{id}/tagsuggest returning an array of tags. Then it's the user choice to accept the suggestion or not. Of course all of this can be automated by a script for bulk uploads.

What do you think of this?

kevynb commented 6 years ago

Hello,

I'm also interested in such a feature.

However, even if it's technically harder to do, I think that having the tag suggestions before saving the document presents more benefits for the user. As a user, I want to tag the document while I'm creating it, not after I'm done.

Maybe the OCR process could be started as soon a something is dropped in the quick upload dropzone so that we have more time to process the file and look for the tags.

I like the way automatic tagging is done in paperless. They allow you to set up some kind of "matchers" that will create tags depending on text that is found in the document ( explained here ).

jendib commented 6 years ago

If we use matchers, I think the tags suggested will be more predictable, and we can add them automatically on documents, without user input. This way it's solving the order issue (and the fact that OCR can take some time, I don't want users to just sit there 20sec waiting for tag suggestions).

mannp commented 5 years ago

Plus one for the matching solution for me too.

I am struggling with the workflow for mass uploading documents and tagging based on things it finds would be a great help in sorting through a mass of uploads.

I am new to the app so am not sure if i am missing the recommended workflow for uploading files?

terba commented 4 years ago

I like the GET /document/{id}/tagsuggest approach. Our workflow is mass creating of documents with a custom script or automatically creating the document form the scanner, and editing one by one manually after OCR processing. If there were a tag suggestion button on the edit form, that would be fine for our case.

A simple full text search on the new documents full content could be a good start. I tried that on the web interface with different document types I have. Simply copied the whole text from /api/file/.../data?size=content of the new file to the search box on the UI. The results were just the related documents which had the same tags I had to set manually for the new document. That can be repeated for all the files of the document. Suggesting the tags that most of the documents have in common would be great. Or the top X tags by count. Something like that.

madduck commented 1 year ago

This would be amazing to have, maybe even on top of nltk, i.e. using machine learning. Paperless-ngx does this, and it's a really good way to have the system learn over time, and lessen the workload on the users.

FoxxMD commented 6 months ago

This is what is preventing me from switching from paperless to teedy. I already have tags set up in paperless based on regular expressions and having to reproduce those manually for 400+ documents if I migrated to teedy is a showstopper for me.