Open valosekj opened 9 months ago
Hey @valosekj, thank you for running the predictions! The results on STIR contrast look good indeed!
The contrast-agnostic MONAI model performs significantly worse on PSIR contrast;
About this -- did you run the predictions on the raw PSIR images? Or, did you do any intensity rescaling? I think 1-2 weeks back @plbenveniste noticed that multiplying the PSIR images by -1
improved the segmentations (link to that the discussion sent to you on slack). Maybe you could do this and see if the results are better?
Note that I used the default 64x160x320 cropping. I will try running the prediction for the second time with R-L cropping.
You could try to 64 x 192 x -1
. The -1 is important because if the spinal cord exceed 320 slices in S-I then it crops out top/bottom part of the cord. ~I am considering making -1
as the default for S-I cropping in my script~
With these changes, I am confident that the results on PSIR will improve!
EDIT: I changed the default crop-size for the inference script in commit https://github.com/sct-pipeline/contrast-agnostic-softseg-spinalcord/pull/60/commits/c4cbd61da6021237128d9fe6d4b41a1c712e90ea
Thank you @naga-karthik! I am now using both suggestions:
-1
to swap contrast from light cord and dark CSF to dark cord and light CSF (commit)64 x 192 x -1
crop size (commit)The contrast-agnostic MONAI model now works significantly better on the PSIR images; see QC video here!
Processed data and QC are saved in ~/duke/projects/canproco/canproco_contrast-agnostic_2023-10-09_PSIR_inv_fixed_patch_size
.
Thanks @valosekj for these changes! Have to say that the predictions look much better now!
We were thinking of adding some intensity-based scaling during training or inference because of PSIR contrast (which was not used in training of contrast-agnostic model. Relevant issue: https://github.com/sct-pipeline/contrast-agnostic-softseg-spinalcord/issues/69
tagging @plbenveniste
Next steps for SC seg:
keep_largest_object
function to the run_inference_single_image.py
script - Jan - done in https://github.com/ivadomed/canproco/pull/44/commits/8a15b162d397338d168bd6d980d241283f697217Next steps for vertebral labeling:
Additional next steps:
sct_analyze_lesion
to the segment_sc_contrast-agnostic.sh script - Jan - done in https://github.com/ivadomed/canproco/pull/44/commits/df718cf671df0b9e699a80f374f144c9223d7de1 and https://github.com/ivadomed/canproco/pull/44/commits/46072a0b4bb55461ed437737f1bdb122535713c5RL flipping experience
RL flipping has proved to have an impact on the quality of the segmentation.
The following GIF shows the difference on subject sub-cal105
(left: original, right: flipped-back)
The process to obtain the segmentations were:
sct_image -i file -o output -flip z
then the contrast-agnostic model is used, and finally the segmentation is flipped back (using the same method)To better understand, here is a GIF showing what flip z
does to the image :
Suggestion:
Nice demonstration @plbenveniste. So there might be something fishy in the prediction code?
I observed similar results for flipping on x (anterior-posterior) and y (inferior-superior). The predictions are different (either better or not).
→ Overall, by running the model several times on the same image with modifications, we can get complementary information. Our strategy is to run the model on the original image, the image flipped on x (anterior-posterior), the image flipped on y (inferior-superior) and the image flipped on z (left-right). The final mask is the sum of the 4 masks, which is then binarized (using sct_maths -i {mask_path} -o {mask_path} -bin 0.5
).
⚠ This will require some post-processing as the spinal cord segmentation goes higher (too high ?) in the brain because of the flip on y (inferior-superior): this will be done thanks to the vertebral levels (which we will label manually or use Nathan's modle to do so)
-> Currently running this
Our strategy is to run the model on the original image, the image flipped on x (anterior-posterior), the image flipped on y (inferior-superior) and the image flipped on z (left-right). The final mask is the sum of the 4 masks, which is then binarized (using sct_maths -i {mask_path} -o {mask_path} -bin 0.5 ).
Won't this lead to an over-segmentation? I would do an average instead of a sum. But if we do an average, we might still "miss" the segmentations that only show up with, e.g., the R-L flip.
Also, what is the rationale for binarizing the output segmentation? In the past we (ie: Naga) noticed that training a softseg model with a mix between soft and hard input biases the model towards having less soft predictions (@naga-karthik can confirm)
From our visual observation, we didn't see any over-segmentation. But yes, it is true that it can happen. Indeed, taking an average is not going to solve that problem. However, binarization can. For now, I am using a threshold of 0.5 (anything above is changed to 1). However, because we know have 4 segmentation: this means that the minimum is 0 and maximum is 4. We can then change the threshold to something higher like 0.7 or above. Therefore, binarization can help prevent over-segmentation. However, what we could do is modify binarization so that anything below 0.7 is 0 and anything above 1 is 1. (we would therefore still have a soft prediction ?) (not so sure about this idea though) -> To be investigated as well
We can then change the threshold to something higher like 0.7 or above. Therefore, binarization can help prevent over-segmentation. However, what we could do is modify binarization so that anything below 0.7 is 0 and anything above 1 is 1.
The only issue I see with this is that the contrast agnostic model is designed to be calibrated across contrasts (ie: a value of 0.8 is supposed to represent 80% of partial volume). If we play around with the output regression values, it defeats the purpose of this calibration. Which is why I was suggesting averaging instead of summing, but if averaging does not solve the issue of 'missing' spinal cord, then that's a problem...
maybe the issue can be further investigated by digging a bit more in the inference pipeline?
Also, what is the rationale for binarizing the output segmentation?
We binarize the output segmentation to make it compatible with sct_label_vertebrae
; see lines here. We will also use the output segmentation for the registration to the template.
We binarize the output segmentation to make it compatible with sct_label_vertebrae; see lines here. We will also use the output segmentation for the registration to the template.
Right, but I would still keep the soft segmentation because we need it for training. And the soft segmentation is the one that needs to be manually corrected (followed by binarization). With your current pipeline, you will end up manually correcting the binary segmentation, so we will end with twice as much manual correction needed.
so to sum up, we need: pred_soft -> pred_soft_manual -> pred_soft_manual_bin
Sorry for the delay in response, had been following the updates in-person.
RL flipping has proved to have an impact on the quality of the segmentation
This is a great idea actually. Glad that we're looking into this. This is essentially test-time-augmentation done in nnUNet (it is called mirroring
there)
I would do an average instead of a sum
This is what we should be doing I believe. Even nnUnet takes the mean of the predictions (see this function)
But if we do an average, we might still "miss" the segmentations that only show up with, e.g., the R-L flip.
This is also true. If there is no prediction on either of the axes we might average it and make it "more soft" than needed.
And the soft segmentation is the one that needs to be manually corrected
@jcohenadad I don't understand how this can be done. With fsleyes all manual corrected labels have a label value of 1, right? how can we have 0.2 or 0.3 at the SC boundaries for example?
@jcohenadad I don't understand how this can be done. With fsleyes all manual corrected labels have a label value of 1, right? how can we have 0.2 or 0.3 at the SC boundaries for example?
Ah! I've been waiting for this question 😊 The only humanly reasonable way I see this, is by altering the soft mask with binary values. Eg: if the rater notices an undersegmentation, they would add 'ones' where the cord is supposed to be. So, 99.99% of the segmentation would still be soft, except for the part that was manually corrected. I believe this is still better than having 0% of the segmentation that is soft.
Ah okay, but in the case of undersegmentation won't the soft values be "sandwiched" between 1s of the actual prediction (say the center of the SC) and the 1s of the manual correction (at the SC boundary)?
Say the green arrow in this case is undersegmentation, then we would have to erase the orange-yellow parts just to the right of the arrow and add our manual corrections
BUT, now that I see it the corrected label will still be soft as you mentioned earlier (and better than only 0s and 1s). So, in the end, it could be done then!
EDIT: added figure
Ah okay, but in the case of undersegmentation won't the soft values be "sandwiched" between 1s of the actual prediction (say the center of the SC) and the 1s of the manual correction (at the SC boundary)? Say the green arrow in this case is undersegmentation, then we would have to erase the orange-yellow parts just to the right of the arrow and add our manual corrections
Indeed, for consistency we would need to do that (ie: make sure there is no value below 1 inside the spinal cord if 1s are added at the border.
BUT, now that I see it the corrected label will still be soft as you mentioned earlier (and better than only 0s and 1s). So, in the end, it could be done then!
Indeed, I would expect that the prediction would be "good enough" so that we wouldn't have to deal with too many of these cases. If there is a systematic undersegmentation, then I think we should refrain from manually correcting the scans and instead enrich/improve the generalizability of the model first, and then re-running the prediction on those scans.
(moving conversation from issue 49 to this issue to centralize everything)
New QC results QC of sc segmentation done using the .sh script which includes:
We must also say that the script, which performs inference, considers that every voxel below 0.5 is the background. Therefore, taking a threshold of 0.5 is considering that if a voxel is labeled at least once we take it into account. The new range is 0.5 to 4 now.
Problematic images:
The problem with the missing bottom of the segmentation is caused by the keep_largest function which only keeps the longest continuous chunk. Therefore, while previously only a small chunk in the middle of the sc was missing, now the pipeline only keeps the bigger part of the sc (often being the top part).
Example with sub-van189_ses-M0_PSIR:
without keep_largest function
with keep_largest function
The problem with the missing bottom of the segmentation is caused by the keep_largest function which only keeps the longest continuous chunk. Therefore, while previously only a small chunk in the middle of the sc was missing, now the pipeline only keeps the bigger part of the sc (often being the top part).
Aoutch! that's a problem indeed. There was also another function, something like "remove small objects" (link here). Maybe that would be more appropriate?
The function "remove small objects" successfully removes a "small" block which is segmented in the eye. To reproduce the error of segmentation in the eye, I used:
Then to get the image without the small error in segmentation, I just added the "remove_small_objects" function in the script with min_size = 500 voxels.
The result is the following for subject sub-mon137
:
The min_size can be discussed. In this case, the size segmentation of the spinal cord without the error in the eye is 6812 voxels and the segmentation in the eye is 120 voxels Should the min_size be set at 500 voxels ? Should it be set at a certain percentage of the total volume (10%) ? Should it be in mm3 ?
In this case, the size segmentation of the spinal cord without the error in the eye is 6812 voxels
Is it in the similar range for a few more subjects? We can decide on a certain percentage of the total volume by getting an average estimate from a few more subjects. What do you think?
Also, having a percentage (instead of a raw number) of voxels is better imo
Script: segment_sc_contrast-agnostic.sh
Steps:
0.5
threshold, and cropping: 64 x 192 x -1 (default))QC can be found at : ~/duke/projects/canproco/canproco_contrast-agnostic_2023-10-17_with_vert_labeling
YML files for PSIR and STIR:
FILES_SEG:
- sub-edm029_ses-M0_PSIR.nii.gz
- sub-mon041_ses-M0_PSIR.nii.gz
- sub-mon036_ses-M0_PSIR.nii.gz
- sub-van116_ses-M0_PSIR.nii.gz
- sub-van154_ses-M0_PSIR.nii.gz
- sub-edm019_ses-M0_PSIR.nii.gz
- sub-edm038_ses-M0_PSIR.nii.gz
- sub-edm075_ses-M0_PSIR.nii.gz
- sub-mon014_ses-M0_PSIR.nii.gz
- sub-mon104_ses-M0_PSIR.nii.gz
- sub-mon152_ses-M0_PSIR.nii.gz
- sub-mon180_ses-M0_PSIR.nii.gz
- sub-van159_ses-M0_PSIR.nii.gz
- sub-van189_ses-M0_PSIR.nii.gz
- sub-tor092_ses-M0_PSIR.nii.gz
FILES_SEG:
- sub-cal095_ses-M0_STIR.nii.gz
- sub-cal078_ses-M0_STIR.nii.gz
- sub-cal115_ses-M0_STIR.nii.gz
- sub-cal160_ses-M0_STIR.nii.gz
- sub-cal198_ses-M0_STIR.nii.gz
manual correction commands (note that -path-img
and -path-label
is the same because both images and SC segs are located under the same folders. Also note that we have to run the manual_correction.py
twice because PSIR and STIR have different -suffix-files-seg
):
# PSIR
python manual_correction.py -config data_processed/seg_to_correct_PSIR.yml -path-img data_processed/data_to_correct_PSIR -path-label data_processed/data_to_correct_PSIR -suffix-files-seg _mul_pred_sum_bin_with_lesion_bin
# STIR
python manual_correction.py -config data_processed/seg_to_correct_STIR.yml -path-img data_processed/data_to_correct_STIR -path-label data_processed/data_to_correct_STIR -suffix-files-seg _pred_sum_bin_with_lesion_bin
Next steps :
sub-mon152
@valosekj Can you update the above TODO list, if I have missed something ?
In the above list, sub-mon152
was not manually corrected as it was part of exclude.yml
but was forgotten in the config file.
Should sc seg be deleted for this subject ?
In the above list,
sub-mon152
was not manually corrected as it was part ofexclude.yml
but was forgotten in the config file. Should sc seg be deleted for this subject ?
If the SC seg (and the image quality) is bad, then yes, we will not include the SC seg for this subject to git-annex.
I ran the first run of the
segment_sc_contrast-agnostic.sh
script (PR https://github.com/ivadomed/canproco/pull/44) across canproco first session (ses-M0
) PSIR and STIR images. The script does the following:sct_maths -bin 0.5
(to make the prediction compatible withsct_label_vertebrae
)sct_label_vertebrae
Initial observations
Spinal cord segmentation:
The contrast-agnostic MONAI model works well on STIR contrast (available for a single site (Calgary)); see the first part of this QC video. The model missed some SC parts, but they are located mainly in the outer slices where SC-CSF contrast is lower. Generally, the segmentations are very good.
The contrast-agnostic MONAI model performs significantly worse on PSIR contrast; see the second part of the QC video (from 00:30). For the PSIR contrast, the model either missed SC parts or segmented structures outside of the SC.
Note that I used the default 64x160x320 cropping. I will try running the prediction for the second time with R-L cropping.
Vertebral labeling:
sct_label_vertebrae
fails during automatic C2-C3 disc detection in ~160 subjects --> those subjects will need manual labeling corrections. I am in favour to do labeling of all discs (instead of labeling only the init C2/C3 disc). This will be slightly more time-consuming, but we will be sure that the labeling will be okay.Processed data and QC are saved in
~/duke/projects/canproco/canproco_contrast-agnostic_2023-10-07
.Tagging @sandrinebedard, @naga-karthik, @plbenveniste.