Verify training script functionality

thomashopkins32 commented 1 year ago

Step through each step in the debugger and make sure the data looks appropriate every step of the way.

Make sure each step of the training script is reproducible since this is an important factor for submitting a notebook to Kaggle.

Initialize the model weights well. Look into the UNet paper for guidance on this. I think there was something mentioned about initialization in there.

Verify that the loss decreases to 0 (or close to it) when we train on a single image (with multiple annotations). If it's not we need to investigate why.

Decrease and increase model capacity, how does this affect the training outcome? Increased capacity should result in lower loss but potentially more overfitting.

Inspect the gradients of each layer's weights. Make sure that they look fairly regular.

thomashopkins32 commented 1 year ago

I rewrote the training script to use train_one_epoch and validate_one_epoch for ease of use.

There are a couple of bugs to still work out:

global step on the writer for tensorboard
possibly need to debug mAP metric
test on a much smaller set of examples

Then we can follow Karpathy's guide to training NNs.

thomashopkins32 commented 1 year ago

Overfitting to a single image works! Here are the parameters I used:

# PARAMETERS
BATCH_SIZE = 1
LR = 1e-4
WD = 0.0
MOMENTUM = 0.0
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
VALID_STEP = 5
RNG = 32
EPOCHS = 50
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

The prediction is on the left and the ground truth is on the right

Figure_1

thomashopkins32 / HuBMAP

Verify training script functionality #9