Running inference and computing test metrics on `deepseg_lesion` data

naga-karthik commented 1 year ago

This issue documents the prerequisite steps for testing the bavaria-quebec model on the dataset used for sct_deepseg_lesion model. Following the information in https://github.com/spinalcordtoolbox/deepseg_lesion_models/issues/2#issuecomment-1624396201, gather the dataset with images, sc-seg labels and lesion-seg labels.

The remaining steps are as follows:

Since the bavaria-quebec model was trained on RPI images, all images in deepseg_lesion have to be converted to RPI also. Run the command for file in *.nii.gz;do sct_image -i ${file} -setorient RPI -o ${file}; done from the root directory.
Run the inference script to get the predictions.
Because we used region-based training, separate (region-based) predictions into SC and lesions
- For SC --> take both values 1 and 2 (i.e union as the SC label)
- For lesion --> take only value 2 as the label
Run anima metrics evaluation

naga-karthik commented 1 year ago

Following these steps, here are the resulting anima metrics:

Inference results on only Axial images in `deepseg_lesion` data (`n=11`)

**SC Seg** Screen Shot 2023-08-11 at 5 49 52 PM

**Lesion Seg** Screen Shot 2023-08-11 at 5 49 11 PM

Inference results on only ALL test images in `deepseg_lesion` data (`n=91`)

**SC Seg** Screen Shot 2023-08-11 at 5 47 36 PM

**Lesion Seg** Screen Shot 2023-08-11 at 5 48 32 PM

Thoughts

Despite not being trained on partial view images (deepseg_lesion contains either cervical or thoracic subjects) and multiple contrasts, the bavaria-quebecmodel has good SC segmentations overall. The lesion segmentations are not there yet, understandably as the contrasts and orientations are very different (note, the bavaria-quebecmodel was only trained on T2w-axial images).

@jqmcginnis what are your observations and thoughts on the next steps?

jqmcginnis commented 1 year ago

Thank you very much for computing the results on the deepseg_sc/deepseg_lesion test set! I am relieved that the results are going in the right direction. Also, I am quite intrigued by the fact that the models performs better on the t2s images than axial images for the lesion segmentation. I would not have expected this.

While we observed good generalization on the bavaria-quebec test set for the joint sc/lesion segmentation and single task segmetation models, I am curious if this is the case for the deepseg data as well. Perhaps, we can run some tests here, as well, to assess if our assumptions are correct:

(1) Segmentation Model - Cord only (2) Segmentation Model - Lesions only (3) Joint Segmentation on straightened cord

I will provide you with these models.

Moreover, I talked to a colleague of mine (Hendrik), who has been successfully applying nn-unet in different scenarios, and he has made some (code) adjustments, particularly for the data augmentation and the patch size parameters when he trains his models. According to him, choosing the wrong patch size (which may be frequently the case), will definitely impact the performance; and as we are predicting on chunks - and not holospinal images, this may have a big impact. Thus, once we have some intuition if the single-class segmentation models perform worse/better, I would consult him and do some testing with different patch sizes, and perhaps other modifications he has incorporated.

Do we have access to the old training data as well? This might also be an angle worth considering.

naga-karthik commented 1 year ago

Thank you very much for computing the results on the deepseg_sc/deepseg_lesion test set!

Just to clear this out -- the results are just from the test set of sct_deepseg_lesion (not sct_deepseg_sc)

I am quite intrigued by the fact that the models performs better on the t2s images than axial

I don't think we can compare it this way -- the number of subjects are not the same, if we had more axial images in the test set then we could have seen a better Dice score?

observed good generalization on the bavaria-quebec test set

This is true but, have you tested it on other (possibly) whole spine or axial images from your in-house datasets?

According to him, choosing the wrong patch size (which may be frequently the case), will definitely impact the performance; and as we are predicting on chunks - and not holospinal images, this may have a big impact.

This is absolutely true! I am observing meaningful difference in generalization performance just by changing the patch sizes, which confirms that it is indeed a crucial parameter. BUT, the question is, how do you decide on an another patch size? It seems that nnUNet uses a good heuristic -- the median size of the images -- which could make sense if you think about it. Any other patch size seems randomly chosen (I did that certainly) which might be hard to justify.

In terms of data augmentation, would you know if he's: (1) adding more augmentations or removing them, OR, (2) changing the probabilities of the existing transformations? Because it seems that augmentations that batchgenerators cover are quite comprehensive as well!

EDIT:

Do we have access to the old training data as well? This might also be an angle worth considering.

Yes we have access to the training data for deepseg_lesion, so I was also thinking of adding to the existing bavaria-quebec whole-spine dataset. Not only because it has partial views (either thoracic or cervical) but it has various contrasts too!

jqmcginnis commented 1 year ago

I don't think we can compare it this way -- the number of subjects are not the same, if we had more axial images in the test set then we could have seen a better Dice score?

Good observation, that was a bit too speculative :sweat_smile:

This is true but, have you tested it on other (possibly) whole spine or axial images from your in-house datasets?

I have run the nn-unet on a multi-timepoint cohort of 416 people (or even more - that's just the subjects that made it past other selection criteria). However, I need to check again to be safe if there is an overlap between the training set and the cohort. If so, I would be surprised and it should be very small. However, we do not have a "hard" GT here - only manually corrected segmentation masks from previously segmented nn-unet models.

Any other patch size seems randomly chosen (I did that certainly) which might be hard to justify.

I would use it as a hyperparameter on a hold-out validation set. I think often the patch size is chosen too large.

Yes we have access to the training data for deepseg_lesion, so I was also thinking of adding to the existing bavaria-quebec whole-spine dataset. Not only because it has partial views (either thoracic or cervical) but it has various contrasts too!

I think we should try both, i.e. adding of other contrasts but also iterating on the current model without the deepseg_leison data - at the moment, we can check how well the model generalizes to other datasets when we keep the current split scenario.

naga-karthik commented 1 year ago

i.e. adding of other contrasts but also iterating on the current model without the deepseg_leison data

Wait, if we ultimately are going to add deepseg_lesion data then why not iterate on the model after adding it? I mean, if we spend time doing hyper-parameter search on just Bavaria data and then add Quebec data and then do hyper-parameter search again? Seems like double the work, unless I'm missing your point somehow

jqmcginnis commented 1 year ago

Wait, if we ultimately are going to add deepseg_lesion data then why not iterate on the model after adding it?

I think we can do a combination of both, it's just I would like to ensure we know what exactly we can attribute the model's success/performance to.

sct-pipeline / bavaria-quebec

Running inference and computing test metrics on `deepseg_lesion` data #22