Create a multi-label classification model for the most used labels.

The long-term plan is to train a model that classifies scientific papers based on their biomimicry function from our taxonomy of 100 leaf labels. Let's start by training a model that identifies the N most used labels in our ground truth dataset, with plans to train more comprehensive models in the future. During this early stage, most documents sent for classification will be "out of domain" for the initial label set; that is, they are papers with labels outside of the initial N labels. If you train a model with the initial N labels and use it with all of your documents, the model will attempt to classify the "out of domain" documents using one of the existing labels, making it less accurate.

In scenarios when you expect your set of labels to expand over time, we recommend training two models using the initial smaller label set:

Classification model (this issue): A model that classifies documents into the current set of labels Filtering model (Issue #69): A model that predicts whether a document fits within the current set of labels or is "out of domain" Submit each document to the filtering model first, and only send documents to the classification model that are "in domain."

With the example described above, the classification model identifies the biomimicry function of a document and the filtering model makes a binary prediction about whether a document belongs to any of the functions for which the classification model has labels.

To train the classification model, only include papers and labels from the set of N most used labels. Add an equivalent number of documents for which the current label set is not appropriate, and label them as "none_of_the_above"

nasa-petal / PeTaL-labeller

Create a multi-label classification model for the most used labels. #70