Segment SC from STIR/PSIR using contrast-agnostic model

valosekj commented 9 months ago

I ran the first run of the segment_sc_contrast-agnostic.sh script (PR https://github.com/ivadomed/canproco/pull/44) across canproco first session (ses-M0) PSIR and STIR images. The script does the following:

segment the spinal cord using the contrast-agnostic MONAI model from PSIR/STIR images
binarize the prediction using sct_maths -bin 0.5 (to make the prediction compatible with sct_label_vertebrae)
perform vertebral labeling using sct_label_vertebrae

Initial observations

Spinal cord segmentation:

The contrast-agnostic MONAI model works well on STIR contrast (available for a single site (Calgary)); see the first part of this QC video. The model missed some SC parts, but they are located mainly in the outer slices where SC-CSF contrast is lower. Generally, the segmentations are very good.
The contrast-agnostic MONAI model performs significantly worse on PSIR contrast; see the second part of the QC video (from 00:30). For the PSIR contrast, the model either missed SC parts or segmented structures outside of the SC.

Note that I used the default 64x160x320 cropping. I will try running the prediction for the second time with R-L cropping.

Vertebral labeling:

sct_label_vertebrae fails during automatic C2-C3 disc detection in ~160 subjects --> those subjects will need manual labeling corrections. I am in favour to do labeling of all discs (instead of labeling only the init C2/C3 disc). This will be slightly more time-consuming, but we will be sure that the labeling will be okay.

Processed data and QC are saved in ~/duke/projects/canproco/canproco_contrast-agnostic_2023-10-07.

Tagging @sandrinebedard, @naga-karthik, @plbenveniste.

naga-karthik commented 9 months ago

Hey @valosekj, thank you for running the predictions! The results on STIR contrast look good indeed!

The contrast-agnostic MONAI model performs significantly worse on PSIR contrast;

About this -- did you run the predictions on the raw PSIR images? Or, did you do any intensity rescaling? I think 1-2 weeks back @plbenveniste noticed that multiplying the PSIR images by -1 improved the segmentations (link to that the discussion sent to you on slack). Maybe you could do this and see if the results are better?

Note that I used the default 64x160x320 cropping. I will try running the prediction for the second time with R-L cropping.

You could try to 64 x 192 x -1. The -1 is important because if the spinal cord exceed 320 slices in S-I then it crops out top/bottom part of the cord. ~I am considering making -1 as the default for S-I cropping in my script~

With these changes, I am confident that the results on PSIR will improve!

EDIT: I changed the default crop-size for the inference script in commit https://github.com/sct-pipeline/contrast-agnostic-softseg-spinalcord/pull/60/commits/c4cbd61da6021237128d9fe6d4b41a1c712e90ea

valosekj commented 9 months ago

Thank you @naga-karthik! I am now using both suggestions:

multiplication of the PSIR images by -1 to swap contrast from light cord and dark CSF to dark cord and light CSF (commit)
64 x 192 x -1 crop size (commit)

The contrast-agnostic MONAI model now works significantly better on the PSIR images; see QC video here!

Processed data and QC are saved in ~/duke/projects/canproco/canproco_contrast-agnostic_2023-10-09_PSIR_inv_fixed_patch_size.

naga-karthik commented 9 months ago

Thanks @valosekj for these changes! Have to say that the predictions look much better now!

We were thinking of adding some intensity-based scaling during training or inference because of PSIR contrast (which was not used in training of contrast-agnostic model. Relevant issue: https://github.com/sct-pipeline/contrast-agnostic-softseg-spinalcord/issues/69

jcohenadad commented 9 months ago

tagging @plbenveniste

valosekj commented 9 months ago

Next steps for SC seg:

[x] try RL flipping - @plbenveniste - see comment below
[ ] try threshold modification - @plbenveniste
[x] add the keep_largest_object function to the run_inference_single_image.py script - Jan - done in https://github.com/ivadomed/canproco/pull/44/commits/8a15b162d397338d168bd6d980d241283f697217

Next steps for vertebral labeling:

[ ] try Nathan's model - @NathanMolinier

Additional next steps:

[x] add sct_analyze_lesion to the segment_sc_contrast-agnostic.sh script - Jan - done in https://github.com/ivadomed/canproco/pull/44/commits/df718cf671df0b9e699a80f374f144c9223d7de1 and https://github.com/ivadomed/canproco/pull/44/commits/46072a0b4bb55461ed437737f1bdb122535713c5

plbenveniste commented 9 months ago

RL flipping experience

RL flipping has proved to have an impact on the quality of the segmentation. The following GIF shows the difference on subject sub-cal105 (left: original, right: flipped-back) ezgif com-gif-maker

The process to obtain the segmentations were:

one segmentation is obtained directly using the contrast-agnostic model
for the other: first, the image is flipped using sct_image -i file -o output -flip z then the contrast-agnostic model is used, and finally the segmentation is flipped back (using the same method)

To better understand, here is a GIF showing what flip z does to the image :

ezgif com-gif-maker (1)

Suggestion:

For the computation on the spinal cord, we could perform the inference on the natural image and the flipped image and then sum both segmentation (or take the average) (before binarization).
Going further with this: we could also flip on other directions (x and y) and get the total average of the 4 predictions.

jcohenadad commented 9 months ago

Nice demonstration @plbenveniste. So there might be something fishy in the prediction code?

plbenveniste commented 9 months ago

I observed similar results for flipping on x (anterior-posterior) and y (inferior-superior). The predictions are different (either better or not). → Overall, by running the model several times on the same image with modifications, we can get complementary information. Our strategy is to run the model on the original image, the image flipped on x (anterior-posterior), the image flipped on y (inferior-superior) and the image flipped on z (left-right). The final mask is the sum of the 4 masks, which is then binarized (using sct_maths -i {mask_path} -o {mask_path} -bin 0.5).

⚠ This will require some post-processing as the spinal cord segmentation goes higher (too high ?) in the brain because of the flip on y (inferior-superior): this will be done thanks to the vertebral levels (which we will label manually or use Nathan's modle to do so)

-> Currently running this

jcohenadad commented 9 months ago

Our strategy is to run the model on the original image, the image flipped on x (anterior-posterior), the image flipped on y (inferior-superior) and the image flipped on z (left-right). The final mask is the sum of the 4 masks, which is then binarized (using sct_maths -i {mask_path} -o {mask_path} -bin 0.5 ).

Won't this lead to an over-segmentation? I would do an average instead of a sum. But if we do an average, we might still "miss" the segmentations that only show up with, e.g., the R-L flip.

Also, what is the rationale for binarizing the output segmentation? In the past we (ie: Naga) noticed that training a softseg model with a mix between soft and hard input biases the model towards having less soft predictions (@naga-karthik can confirm)

plbenveniste commented 9 months ago

From our visual observation, we didn't see any over-segmentation. But yes, it is true that it can happen. Indeed, taking an average is not going to solve that problem. However, binarization can. For now, I am using a threshold of 0.5 (anything above is changed to 1). However, because we know have 4 segmentation: this means that the minimum is 0 and maximum is 4. We can then change the threshold to something higher like 0.7 or above. Therefore, binarization can help prevent over-segmentation. However, what we could do is modify binarization so that anything below 0.7 is 0 and anything above 1 is 1. (we would therefore still have a soft prediction ?) (not so sure about this idea though) -> To be investigated as well

jcohenadad commented 9 months ago

We can then change the threshold to something higher like 0.7 or above. Therefore, binarization can help prevent over-segmentation. However, what we could do is modify binarization so that anything below 0.7 is 0 and anything above 1 is 1.

The only issue I see with this is that the contrast agnostic model is designed to be calibrated across contrasts (ie: a value of 0.8 is supposed to represent 80% of partial volume). If we play around with the output regression values, it defeats the purpose of this calibration. Which is why I was suggesting averaging instead of summing, but if averaging does not solve the issue of 'missing' spinal cord, then that's a problem...

maybe the issue can be further investigated by digging a bit more in the inference pipeline?

valosekj commented 9 months ago

Also, what is the rationale for binarizing the output segmentation?

We binarize the output segmentation to make it compatible with sct_label_vertebrae; see lines here. We will also use the output segmentation for the registration to the template.

jcohenadad commented 9 months ago

We binarize the output segmentation to make it compatible with sct_label_vertebrae; see lines here. We will also use the output segmentation for the registration to the template.

Right, but I would still keep the soft segmentation because we need it for training. And the soft segmentation is the one that needs to be manually corrected (followed by binarization). With your current pipeline, you will end up manually correcting the binary segmentation, so we will end with twice as much manual correction needed.

so to sum up, we need: pred_soft -> pred_soft_manual -> pred_soft_manual_bin

naga-karthik commented 9 months ago

Sorry for the delay in response, had been following the updates in-person.

RL flipping has proved to have an impact on the quality of the segmentation

This is a great idea actually. Glad that we're looking into this. This is essentially test-time-augmentation done in nnUNet (it is called mirroring there)

I would do an average instead of a sum

This is what we should be doing I believe. Even nnUnet takes the mean of the predictions (see this function)

But if we do an average, we might still "miss" the segmentations that only show up with, e.g., the R-L flip.

This is also true. If there is no prediction on either of the axes we might average it and make it "more soft" than needed.

And the soft segmentation is the one that needs to be manually corrected

@jcohenadad I don't understand how this can be done. With fsleyes all manual corrected labels have a label value of 1, right? how can we have 0.2 or 0.3 at the SC boundaries for example?

jcohenadad commented 9 months ago

@jcohenadad I don't understand how this can be done. With fsleyes all manual corrected labels have a label value of 1, right? how can we have 0.2 or 0.3 at the SC boundaries for example?

Ah! I've been waiting for this question 😊 The only humanly reasonable way I see this, is by altering the soft mask with binary values. Eg: if the rater notices an undersegmentation, they would add 'ones' where the cord is supposed to be. So, 99.99% of the segmentation would still be soft, except for the part that was manually corrected. I believe this is still better than having 0% of the segmentation that is soft.

naga-karthik commented 9 months ago

Ah okay, but in the case of undersegmentation won't the soft values be "sandwiched" between 1s of the actual prediction (say the center of the SC) and the 1s of the manual correction (at the SC boundary)?

Say the green arrow in this case is undersegmentation, then we would have to erase the orange-yellow parts just to the right of the arrow and add our manual corrections

BUT, now that I see it the corrected label will still be soft as you mentioned earlier (and better than only 0s and 1s). So, in the end, it could be done then!

EDIT: added figure

jcohenadad commented 9 months ago

Ah okay, but in the case of undersegmentation won't the soft values be "sandwiched" between 1s of the actual prediction (say the center of the SC) and the 1s of the manual correction (at the SC boundary)? Say the green arrow in this case is undersegmentation, then we would have to erase the orange-yellow parts just to the right of the arrow and add our manual corrections

Indeed, for consistency we would need to do that (ie: make sure there is no value below 1 inside the spinal cord if 1s are added at the border.

BUT, now that I see it the corrected label will still be soft as you mentioned earlier (and better than only 0s and 1s). So, in the end, it could be done then!

Indeed, I would expect that the prediction would be "good enough" so that we wouldn't have to deal with too many of these cases. If there is a systematic undersegmentation, then I think we should refrain from manually correcting the scans and instead enrich/improve the generalizability of the model first, and then re-running the prediction on those scans.

plbenveniste commented 9 months ago

(moving conversation from issue 49 to this issue to centralize everything)

New QC results QC of sc segmentation done using the .sh script which includes:

x, y, and z flipping
keep the largest function from ivadomed : it only keeps the biggest continuous segmented object
the sum of the mask
cropping parameters done with 64 x 192 x -1 (default)
binarization threshold equal to 0.5

We must also say that the script, which performs inference, considers that every voxel below 0.5 is the background. Therefore, taking a threshold of 0.5 is considering that if a voxel is labeled at least once we take it into account. The new range is 0.5 to 4 now.

Problematic images:

Problems with the following files

- sub-edm029_ses-M0_PSIR : missing bottom - sub-edm075_ses-M0_PSIR : missing bottom - sub-edm156_ses-M0_PSIR : missing bottom (small chunk) - sub-mon014_ses-M0_PSIR : missing bottom - sub-mon104_ses-M0_PSIR : missing bottom - sub-mon113_ses-M0_PSIR : missing bottom + data quality issue (changing from black to white at back of neck - sub-mon138_ses-M0_PSIR : missing bottom + data quality issue (blury image) - sub-mon152_ses-M0_PSIR : data quality (changing from white to black at back of neck) - sub-tor092_ses-M0_PSIR : missing bottom + motion - sub-van116_ses-M0_PSIR : missing bottom - sub-van159_ses-M0_PSIR : missing bottom + motion - sub-van189_ses-M0_PSIR : missing bottom

The problem with the missing bottom of the segmentation is caused by the keep_largest function which only keeps the longest continuous chunk. Therefore, while previously only a small chunk in the middle of the sc was missing, now the pipeline only keeps the bigger part of the sc (often being the top part).

Example with sub-van189_ses-M0_PSIR:

without keep_largest function

with keep_largest function

jcohenadad commented 9 months ago

The problem with the missing bottom of the segmentation is caused by the keep_largest function which only keeps the longest continuous chunk. Therefore, while previously only a small chunk in the middle of the sc was missing, now the pipeline only keeps the bigger part of the sc (often being the top part).

Aoutch! that's a problem indeed. There was also another function, something like "remove small objects" (link here). Maybe that would be more appropriate?

plbenveniste commented 9 months ago

The function "remove small objects" successfully removes a "small" block which is segmented in the eye. To reproduce the error of segmentation in the eye, I used:

run_single_inference (without "remove_small_objects" function)
image multiplied by -1
cropping: 100x320x320

Then to get the image without the small error in segmentation, I just added the "remove_small_objects" function in the script with min_size = 500 voxels. The result is the following for subject sub-mon137:

ezgif com-gif-maker

The min_size can be discussed. In this case, the size segmentation of the spinal cord without the error in the eye is 6812 voxels and the segmentation in the eye is 120 voxels Should the min_size be set at 500 voxels ? Should it be set at a certain percentage of the total volume (10%) ? Should it be in mm3 ?

naga-karthik commented 9 months ago

In this case, the size segmentation of the spinal cord without the error in the eye is 6812 voxels

Is it in the similar range for a few more subjects? We can decide on a certain percentage of the total volume by getting an average estimate from a few more subjects. What do you think?

Also, having a percentage (instead of a raw number) of voxels is better imo

plbenveniste commented 9 months ago

QC of the final experiment

Script: segment_sc_contrast-agnostic.sh

Steps:

multiplied by -1 for PSIR images (to invert contrast)
run_single_inference (with the remove_small_objects function, 0.5 threshold, and cropping: 64 x 192 x -1 (default))
4 inferences per image (1 on original image and one per flip (x,y,z))
Sum all 4 predictions
adding the lesion mask to the SC mask (because the lesions need to be in the SC for a region-based nnunet)

QC can be found at : ~/duke/projects/canproco/canproco_contrast-agnostic_2023-10-17_with_vert_labeling

Minor problem (10 files):

- sub-cal095/ses-M0 - sub-cal078/ses-M0 - sub-cal115/ses-M0 - sub-cal160/ses-M0 - sub-cal198/ses-M0 - sub-edm029/ses-M0 - sub-mon041/ses-M0 - sub-mon036/ses-M0 - sub-van116/ses-M0 - sub-van154/ses-M0

Major problem (10 files):

- sub-edm019/ses-M0 (seg outside of sc) - sub-edm038/ses-M0 (seg outside of sc) - sub-edm075/ses-M0 (missing chunk) - sub-mon014/ses-M0 (missing chunk) - sub-mon104/ses-M0 (missing chunk) - sub-mon152/ses-M0 (almost no seg) - sub-mon180/ses-M0 (missing chunk) - sub-van159/ses-M0 (missing chunk) - sub-van189/ses-M0 (missing chunk) - sub-tor092/ses-M0 (missing chunk)

manual correction process

YML files for PSIR and STIR:

FILES_SEG:
- sub-edm029_ses-M0_PSIR.nii.gz
- sub-mon041_ses-M0_PSIR.nii.gz
- sub-mon036_ses-M0_PSIR.nii.gz
- sub-van116_ses-M0_PSIR.nii.gz
- sub-van154_ses-M0_PSIR.nii.gz
- sub-edm019_ses-M0_PSIR.nii.gz
- sub-edm038_ses-M0_PSIR.nii.gz
- sub-edm075_ses-M0_PSIR.nii.gz
- sub-mon014_ses-M0_PSIR.nii.gz
- sub-mon104_ses-M0_PSIR.nii.gz
- sub-mon152_ses-M0_PSIR.nii.gz
- sub-mon180_ses-M0_PSIR.nii.gz
- sub-van159_ses-M0_PSIR.nii.gz
- sub-van189_ses-M0_PSIR.nii.gz
- sub-tor092_ses-M0_PSIR.nii.gz

FILES_SEG:
- sub-cal095_ses-M0_STIR.nii.gz
- sub-cal078_ses-M0_STIR.nii.gz
- sub-cal115_ses-M0_STIR.nii.gz
- sub-cal160_ses-M0_STIR.nii.gz
- sub-cal198_ses-M0_STIR.nii.gz

manual correction commands (note that -path-img and -path-label is the same because both images and SC segs are located under the same folders. Also note that we have to run the manual_correction.py twice because PSIR and STIR have different -suffix-files-seg):

# PSIR
python manual_correction.py -config data_processed/seg_to_correct_PSIR.yml -path-img data_processed/data_to_correct_PSIR -path-label data_processed/data_to_correct_PSIR -suffix-files-seg _mul_pred_sum_bin_with_lesion_bin

# STIR
python manual_correction.py -config data_processed/seg_to_correct_STIR.yml -path-img data_processed/data_to_correct_STIR -path-label data_processed/data_to_correct_STIR -suffix-files-seg _pred_sum_bin_with_lesion_bin

Next steps :

[x] Manual correction of major issues
[x] Manual correction of minor issues
[x] Remove SC seg of sub-mon152
[x] Conversion to BIDS format into canproco dataset (creation of .json files)
[ ] Verification of dataset integrity
[ ] Push to git-annex

@valosekj Can you update the above TODO list, if I have missed something ?

plbenveniste commented 9 months ago

In the above list, sub-mon152 was not manually corrected as it was part of exclude.yml but was forgotten in the config file. Should sc seg be deleted for this subject ?

valosekj commented 9 months ago

In the above list, sub-mon152 was not manually corrected as it was part of exclude.yml but was forgotten in the config file. Should sc seg be deleted for this subject ?

If the SC seg (and the image quality) is bad, then yes, we will not include the SC seg for this subject to git-annex.

ivadomed / canproco