nadeemlab / SPT

Spatial profiling toolbox for spatial characterization of tumor immune microenvironment in multiplex images
https://oncopathtk.org
Other
21 stars 2 forks source link

Balance graph datasets #290

Open CarlinLiao opened 10 months ago

CarlinLiao commented 10 months ago

Since we're adapting pathology datasets for use in machine learning, often our datasets end up being imbalanced, e.g., the treatment non-responding set ends up having 2x the number of graphs as the responding to treatment set. This can lead to model overfitting since it's seeing 2x or worse of examples from one category than the other.

I'd like to go in and add an option to the train/validation/test set split such that, if one or more classes has more examples than another class, I shunt off excess examples into what is currently the "unlabeled" class but would be renamed to the "not used for training, validation, or testing" class. This function could also be propagated to spt-plugin so forked plugins also have access to it.