Can you provide the metrics evaluation scripts for the WORDS and FLARE datasets?

OeslleLucena commented 1 year ago

I am trying to reproduce STU-NET results and would like to have the same evaluation scripts you used to compute the dice score for the WORDS and FLARE datasets. I would appreciate it if the authors could make these scripts available. Best

Ziyan-Huang commented 1 year ago

Dear @OeslleLucena,

Thank you for reaching out. We primarily calculated the Dice Similarity Coefficient (DSC) for each class. You should be able to find relevant code for this quite easily. For instance, the official FLARE repository contains scripts for computing the DSC. We recommend checking there as a starting point.

Best regards,

Ziyan Huang

OeslleLucena commented 1 year ago

Dear @Ziyan-Huang

Thank you for your response. Apologies on my side because I think I did not make myself clear enough. What I meant is that STU-NET outputs the segmentation for all labels from the TotalSegmentator dataset, and I would like to know how the selection and merging of these labels were when compared with the ground truth for the WORD and FLARE datasets. I.e. WORD datast has 16 labels, some were them are easy to find such as liver but the rest are a bit different than TotalSegmentator ones. Hope that is clear enough. Many thanks in advance,

blueyo0 commented 1 year ago

Hi, @OeslleLucena

For WORD, we selected 13 out of 16 classes overlapping with TotalSegmentator for inference and metric calculation; for FLARE22, all 13 categories were calculated. You can refer to the appendix of our arxiv paper for details. To clarify the details and help to conduct experiments and reproduce the results, we will release the code for direct inference soon.

Here is a simple dict in Python showing which categories are selected, more details will be clarified soon 😉.

Task560_WORD_sys = {
    "1": "liver",
    "10": "colon",
    # "11": "intestine",
    # "12": "adrenal",
    # "13": "rectum",
    "14": "urinary_bladder",
    "15": "femur_left",
    "16": "femur_right",
    "2": "spleen",
    "3": "kidney_left",
    "4": "kidney_right",
    "5": "stomach",
    "6": "gallbladder",
    "7": "esophagus",
    "8": "pancreas",
    "9": "duodenum"
}
FLARE22_sys = {
    "1":  "liver",
    "10": "esophagus",
    "11": "stomach",
    "12": "duodenum",
    "13": "left kidney",
    "2":  "right kidney",
    "3":  "spleen",
    "4":  "pancreas",
    "5":  "aorta",
    "6":  "IVC",
    "7":  "RAG",
    "8":  "LAG",
    "9":  "gallbladder"
}

Hope my answer can help you.

OeslleLucena commented 1 year ago

HI @blueyo0, Thank you loads for the details. Looking forward to the code for direct inference. Best!

uni-medical / STU-Net

Can you provide the metrics evaluation scripts for the WORDS and FLARE datasets? #14