Separate Training, Validation, and Testing data into different datasets

thomashopkins32 commented 1 year ago

To make the transforms easier to work with. We should pre-split the data into training, validation, and testing.

The testing data is a single image and is already split off. The training data needs to be randomly split and this split needs to be saved somewhere.

This is required so that we can use no image transformations during validation and also get accurate class frequencies during training. If we use a single dataset and then do a split we run into the following issues:

Data leakage from computing the class frequencies using validation data
Validation data does not reflect real-world data (it has been randomly augmented)

We should implement TrainHuBMAP, ValidHuBMAP, and TestHuBMAP datasets instead of the single HuBMAP dataset.

thomashopkins32 commented 1 year ago

We can also look into using cross-validation instead but this seems too expensive for such a large model.

thomashopkins32 commented 1 year ago

Not necessary to do. I forgot I return both the transformed and original images in the batch!

thomashopkins32 / HuBMAP

Separate Training, Validation, and Testing data into different datasets #36