nasa-petal / PeTaL-labeller

The PeTaL labeler labels journal articles with biomimicry functions.
https://petal-labeller.readthedocs.io/en/latest/
The Unlicense
6 stars 3 forks source link

Create a binary classification filtering model for the top 25% of labels #69

Open bruffridge opened 3 years ago

bruffridge commented 3 years ago

The long-term plan is to train a model that classifies scientific papers based on their biomimicry function from our taxonomy of 100 leaf categories. Let's start by training a model that can accurately label using the top 25% most used labels in our ground truth dataset, with plans to train more comprehensive models in the future. During this early stage, most documents sent for classification will be "out of domain" for the initial label set; that is, they are papers with labels outside of the initial top 25%. If you train a model with the initial top 25% labels and use it with all of your documents, the model will attempt to classify the "out of domain" documents using one of the existing labels, making it less accurate.

In scenarios when you expect your set of labels to expand over time, we recommend training two models using the initial smaller label set:

Classification model (Issue #70): A model that classifies documents into the current set of labels Filtering model (this issue): A model that predicts whether a document fits within the current set of labels or is "out of domain" Submit each document to the filtering model first, and only send documents to the classification model that are "in domain."

With the example described above, the classification model identifies the biomimicry function of a document and the filtering model makes a binary prediction about whether a document belongs to any of the functions for which the classification model has labels.

To train the filtering model, use the same set of documents you used for the classification model, except label each document as "in domain" instead of using a specific label from your set. Add an equivalent number of documents for which the current label set is not appropriate, and label them as "out of domain."