jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 356 forks source link

Classifier error: No training data available #163

Closed metril closed 3 years ago

metril commented 3 years ago

I am running v0.9.8 and have tried both the Postgres and Sqlite versions of the docker-compose setup. I have added 17 documents to Paperless and I have gone ahead and tagged stuff with Correspondents, Tags, and Document Type. I have precreated all of these identifiers and have put the algorithm to "Auto" and have not input anything into their fields other than a name. The Match field is left blank because I assume the "Auto" means that the training will go ahead and find a bunch of words to match. In all cases, I am unable to train the model. What is going on?

jonaswinkler commented 3 years ago

Are these documents still tagged with a tag that is marked as an inbox tag?

metril commented 3 years ago

Yes, these documents have an Inbox tag. I have created an tag called Inbox and have the checkbox to on the "inbox".

The Inbox tag (with the checkbox selected) and additional tags like Bank, Bill , etc (without the checkbox selected) were created and applied to the documents manually.

jonaswinkler commented 3 years ago

Documents marked with inbox tags are excluded from the training data for a very specific reason. Remove the inbox tag, and the classification model will pick up the documents. This is also mentioned in the documentation.

The reason for that is related to how the underlying classification model works. With that few documents, the algorithm might get a lot of tags and correspondents wrong. When you scan new documents, and the "Auto" matching algorithm assigns wrong tags or correspondents to these new documents, it will use these wrongly tagged documents the next time it updates the model, and reinforce wrong decisions. We don't want that. Therefore, we need some mechanism to tell the model which documents are tagged correctly, and removing the Inbox tag implements this mechanism in Paperless.

metril commented 3 years ago

Gotcha! Could you please link me to the page in the documentation where it mentions this.

I may have glossed over it. I also tried searching for "training" as a keyword.

jonaswinkler commented 3 years ago

https://paperless-ng.readthedocs.io/en/latest/advanced_usage.html#automatic-matching

I still need to work on the documentation, maybe some sections should be placed differently.

metril commented 3 years ago

Thank you! The classifier training is working now!

shamoon commented 3 years ago

Oh wow I never knew this! Yea I agree perhaps we should figure out how to make this more obvious to new users. A few clarification questions:

jonaswinkler commented 3 years ago
  • Would it make sense to by default have an “Inbox” tag? If you setup a new install without one doesn’t that mean you might mess up initial training data?

I don't want an Inbox tag by default, since that is an optional feature, and some users might not want to use it.

You really can't mess up training data. The training data is exactly the current set of documents without inbox tags. Correct any errors, and the underlying model will work correctly. It does not remember anything from before.

  • Maybe I’m confused but if “Inbox” items never get auto-matched at what point will users start to see the effects of auto-matching? If it doesn’t happen until after you remove something from your inbox then isn’t that pointless because you’re probably taking something out of your inbox since you’re done tagging it? Maybe you mean only during training period?

Users will start to see the effects of auto matching as soon as they have documents with Auto-matching metadata in their archive, which are not in the inbox anymore. Removing documents from the inbox is telling paperless "Hey, these documents look good / are correctly tagged, please use that for all new documents".

  • I assume the only way to reset training data is to completely wipe the installation?

Training data = your document collection. The model does not remember anything. If you change something, the model will reflect those changes (after re-training).

metril commented 3 years ago

If you keep the entire dataset as the "training" dataset, then you run into issues of overfitting. I can also see a scenario where your training takes longer and longer the more documents are added. I assume that overtime, folks will remove the "Inbox" tag from their documents. That said, would it be more beneficial to have a tag specifically for training? This also kind of goes along the line of not having a scheduled training task and instead just exposing a button or something to invoke the training (instead of command line or scheduler methods).

shamoon commented 3 years ago

Hmm, personally I dont think most people want an entire tag which they use for organizing their documents, dedicated to training. IMHO one of the awesome things about this project is how it all just kind of works behind the scenes. I just want to make sure I understand, makes more sense to me now, thanks for explaining @jonaswinkler

jonaswinkler commented 3 years ago

If you keep the entire dataset as the "training" dataset, then you run into issues of overfitting.

I'm dropping most irrelevant and infrequently used features. I don't see how using the entire document collection has anything to do with overfitting. I've tested multiple configurations of the model, and on my dataset (2500 documents, 5 types, 15 tags, ~40 correspondents), the chosen configuration works reasonably well both for tags with many associated documents as well as for correspondents with only a few associated documents.

I can also see a scenario where your training takes longer and longer the more documents are added.

This is true. That's why the training is scheduled to run at some point in the background. I'd even consider running that just every day, at night. Would be enough. However, unless you're adding millions of documents (which paperless isn't designed for), this is not an issue, since the amount of data is relatively small.

That said, would it be more beneficial to have a tag specifically for training? This also kind of goes along the line of not having a scheduled training task and instead just exposing a button or something to invoke the training (instead of command line or scheduler methods).

No, because I want to hide away most of the complexity, and tags should really be used for describing actual properties of the documents, and not for controlling some internal process.