SSL pre-training and fine-tuning on `(64, 256, 256)` samples

This issue summarizes initial experiments with self-supervised (SSL) pre-training and fine-tuning using Vision Transformers (ViT). The experiments are based on this MONAI tutorial and code is available within branch jv/vit_unetr_ssl.

The idea is to do self-supervised pre-training on unlabeled images and then do supervised fine-tuning for a specific task, e.g., DCM lesion segmentation.

For simplicity, all experiments have been done so far as single-channel using only T2w contrast.

Pre-training

The pre-training is done on spine-generic multi-subject T2w images using ViTAutoEncscript model by script vit_unetr_ssl/train.py.

First, two augmented views are created for each original training image (see lines here). Then, the contrastive loss is used to bring the two augmented views closer to each other if the views are generated from the same patch; if not it tries to maximize the disagreement.

So far, I have used a spatial size of 64, 256, 256:

The pre-training (500 epochs, batch size of 2) on 236/29 train/val images (T2w resampled to 1mm iso) took ~50 hours on a single GPU on romane. I had to set number of workers to 0 due to RuntimeError: Pin memory thread exited unexpectedly. With a higher number of workers, the training would probably be faster.

Training & Validation Curves for pre-training SSL

![image](https://github.com/ivadomed/model-seg-dcm/assets/39456460/ca0082ac-99fd-4dda-ad46-ec870f9e5cbf)

Fine-tuning

The fine-tuning is done on dcm-zurich-lesion patients as a supervised task (i.e., providing T2w images and lesion labels) using a script vit_unetr_ssl/finetune.py. The pre-trained weights are loaded into UNETR model.

ivadomed / model-seg-dcm

SSL pre-training and fine-tuning on `(64, 256, 256)` samples #7

Pre-training

Fine-tuning