Since we're adapting pathology datasets for use in machine learning, often our datasets end up being imbalanced, e.g., the treatment non-responding set ends up having 2x the number of graphs as the responding to treatment set. This can lead to model overfitting since it's seeing 2x or worse of examples from one category than the other.
I'd like to go in and add an option to the train/validation/test set split such that, if one or more classes has more examples than another class, I shunt off excess examples into what is currently the "unlabeled" class but would be renamed to the "not used for training, validation, or testing" class. This function could also be propagated to spt-plugin so forked plugins also have access to it.
Since we're adapting pathology datasets for use in machine learning, often our datasets end up being imbalanced, e.g., the treatment non-responding set ends up having 2x the number of graphs as the responding to treatment set. This can lead to model overfitting since it's seeing 2x or worse of examples from one category than the other.
I'd like to go in and add an option to the train/validation/test set split such that, if one or more classes has more examples than another class, I shunt off excess examples into what is currently the "unlabeled" class but would be renamed to the "not used for training, validation, or testing" class. This function could also be propagated to spt-plugin so forked plugins also have access to it.