wasserth / TotalSegmentator

Tool for robust segmentation of >100 important anatomical structures in CT and MR images
Apache License 2.0
1.41k stars 234 forks source link

TotalSegmentator evaluation pipeline #143

Open denbonte opened 1 year ago

denbonte commented 1 year ago

Dear Jakob,

Good day and thanks again for all the work and efforts you are devoting to the project. Also, congrats on the recent Radiology AI publication!

Regarding that - and I'm sorry if I'm posting this as an issue on GitHub, but I thought it might as well help others, e.g., people that are trying to replicate the training process or re-run it on in-house data - I wanted to ask a couple of simple questions.

I'm splitting these into sections to (hopefully!) make them more transparent and readable and save you some time 🙃

Test Data Resampling

Was the evaluation of the pipeline (i.e., the calculation of the Dice Coefficient and the Normalized Surface Distance) computed on the resampled dataset?

In other words, was the ground truth resampled to 1.5mm with the CT data and then compared to the output of TotalSegmentator - or was the pipeline's output resampled back to the original resolution and then compared with the ground truth? Of course, I'm assuming the native resolution of the scans is not 1.5mm isotropic (personally, I've never seen 1.5mm isotropic CT data, but it might be just a lack of past experience on my side)!

I think this is relevant because (I guess) most people are/will be running TotalSegmentator on data at a native resolution and therefore comparing the model's output to labels (ground truth, but also other models' outputs) at a native resolution. This, as you can imagine, could skew the comparisons quite a bit (of course, a model outputting a segmentation at 1.5mm isotropic won't perform as well as it would if the data was resampled when comparing it to some other labels computed on, e.g., 0.7x0.7x1.5mm).

Dice Score Aggregation

How did you aggregate the Dice Score among all of the 104 (or 13) structures? E.g., did you average the Dice Scores for every structure, or weight each structure by the number of voxels, or...?

Again, I think this might be useful for people trying to replicate, extend, or compare models to the TotalSegmentator pipeline


P.S. - I did not find the answers to these questions in the paper - and I very much apologize in advance if I overlooked them somehow. I hope these questions (and others that people might ask!) can help driving forward transparency and reproducibility and the adoption of tools like TotalSegmentator in the community 😄

fedorov commented 1 year ago

I was not sure if this was worth a separate issue, but I also wanted to ask about the training dataset. The Zenodo dataset does not have any metadata accompanying the CT nifti files (unless I missed it), while the paper does mention the below:

The dataset contained a high variety of CT images, with differences in slice thickness, resolution, and contrast phase (native, arterial, portal venous, late phase, and others). Dual-energy CT images obtained using different tube voltages were also included. Different kernels (soft tissue kernel, bone kernel), as well as CT images from 8 different sites and 16 different scanners were included in the dataset; however, most images were acquired using a Siemens manufacturer. A total of 404 patients showed no signs of pathology, whereas 645 showed different types of pathology (tumor, vascular, trauma, inflammation, bleeding, other). Information regarding presence of pathologies was not available for 155 patients due to missing radiologic reports

Would it be possible to share the per-scan acquisition metadata and information about pathology?

wasserth commented 1 year ago

Good questions! The data was resampled to 1.5mm, then the manuall annotation was done on these 1.5mm images. The evaluation is also done on these 1.5mm images then. For each structure we calculated the dice score. Then we took the mean across all structures and subjects. No weighting was involved. If the groundtruth mask is empty for a structure (e.g. an image of the abdomen contains an empty ground truth brain mask), then the Dice/Normalised surface distance score is set to NAN and when doing the averaging these NAN values are ignored. The Normalised Surface Distance is calculated using the function compute_surface_dice_at_tolerance from this package: https://github.com/deepmind/surface-distance tolerance_mm is set to 3mm.

Does this answer your questions? Otherwise let me know!

wasserth commented 1 year ago

I was not sure if this was worth a separate issue, but I also wanted to ask about the training dataset. The Zenodo dataset does not have any metadata accompanying the CT nifti files (unless I missed it), while the paper does mention the below:

The dataset contained a high variety of CT images, with differences in slice thickness, resolution, and contrast phase (native, arterial, portal venous, late phase, and others). Dual-energy CT images obtained using different tube voltages were also included. Different kernels (soft tissue kernel, bone kernel), as well as CT images from 8 different sites and 16 different scanners were included in the dataset; however, most images were acquired using a Siemens manufacturer. A total of 404 patients showed no signs of pathology, whereas 645 showed different types of pathology (tumor, vascular, trauma, inflammation, bleeding, other). Information regarding presence of pathologies was not available for 155 patients due to missing radiologic reports

Would it be possible to share the per-scan acquisition metadata and information about pathology?

So far this is not available. I will add it in the next release of the dataset.

denbonte commented 1 year ago

Hey Jakob,

Many thanks for the swift reply, as always!

Does this answer your questions? Otherwise let me know!

It does for me - thanks!

fedorov commented 1 year ago

So far this is not available. I will add it in the next release of the dataset.

Thank you!

We at IDC would be rather interested to host this dataset. This would provide significant benefits to its users, I believe, since:

Only DICOM data can be hosted in IDC, and so we were considering converting nifti CT to DICOM, but a major part of DICOM value lies in the metadata. Right now, the only metadata we have is Modality=CT and image geometry (and as I understand, those are the resampled images). Whatever metadata you are able to share will hopefully be possible to inject into those DICOMs we would create. Ideally, we would like to have the original de-identified DICOM, but I understand sharing those is probably difficult due to de-identification challenges.

denbonte commented 1 year ago
  • it will become possible to download individual scans/subsets as needed
  • download speeds should be much faster with IDC data in Google/Amazon buckets
  • it will be possible to visualize individual scans and segmentations in-browser

As a developer, I think this is very interesting.

Zenodo can be very slow and it's impossible to pull just a subset of the dataset, or visualize the data before pulling dozens of GBs.

For instance, before opening this issue, I wanted to double-check whether the data Jakob uploaded to Zenodo was resampled to 1.5mm or not. Well, via browser or wget it was giving me 7 hours give or take, so I gave up 😅

wasserth commented 1 year ago

I see your points. Unfortunately, Zenodo seems to be quite slow sometimes. I agree, that uploading to a AWS bucket would be better. Then downloading the 28GB should also be possible in a reasonable time.

I think the dataset is mainly used by researchers doing image analysis. In this case you do not really work with DICOM images, but you typically convert them to nifti or nrrd to really work with them in python. But this conversion process can already be a big pain. At least that is my experience in a lot of other projects. I am always happy if I can download the data as nifti and not as DICOMs. Therefore my plan would be to host the niftis on a faster server and provide all the meta data as a csv file. And with a size of 28GB the download of the entire dataset is still manageable.

At the moment I also can not provide the original dicoms publicly. I can only provide the resampled 1.5mm images.

denbonte commented 1 year ago

Hey Jakob,

I think the dataset is mainly used by researchers doing image analysis. In this case you do not really work with DICOM images, but you typically convert them to nifti or nrrd to really work with them in python. But this conversion process can already be a big pain. At least that is my experience in a lot of other projects. I am always happy if I can download the data as nifti and not as DICOMs.

I see your point, of course - everyone is working with either NIfTI or NRRD of MHA under the hood. Still, I personally don't love starting from data that has already been converted if I or some of my colleagues are not the ones that did the conversion.

I have seen so many cases of straightforward steps (as simple as clipping, resampling, and rotating a volume) that result in something unexpected and therefore yield results that are not reproducible (which is very detrimental for science). Furthermore, I personally prefer to have the segmentation saved in a format like DICOM SEG, linked back to the original data - so that no matter what other researchers want to do with the images and labels (resampling, cropping, resampling after cropping, or whatever) they will have no constraints on the usage of such data (which IMHO is crucial to enable the translation of some of the tools in the clinic, or in general enable better science).

That's not a critique of your work, of course. What you guys have done is fantastic, and the tool does not really suffer from these problems (as TotalSegmentator/nnU-Net always resamples back to the original resolution anyway). I just wanted to provide a different point of view on the data debacle - for, as crazy as it sounds, in the last 10 years, the number of large and high-quality datasets accompanied by segmentations shared publicly for research purposes has not increased a lot if it ever did (and this can significantly slow down research).

At the moment I also can not provide the original dicoms publicly. I can only provide the resampled 1.5mm images.

I understand the challenges related to PHI handling and whatnot (and I can imagine it could be even more difficult or simply different in Switzerland as opposed to a European country). Best of luck with that 🙃 if there is one person in the community that might be able to give you advice on procedures (like DICOM anonymization) that can ease the process, that is precisely @fedorov - so the moment you figure out what needs to be figured out, you'll be in excellent hands 😄

Thanks again for getting back and engaging in the conversation, Jakob!

ibro45 commented 1 year ago

Hi, thank you for your work!

To ensure reproducibility and fair comparison, would you please share the metric calculation code? It is not necessarily too difficult to reproduce it, but it is easy to omit a step - e.g., accounting for empty ground truth with NaNs or doing micro instead of macro averaging for Dice and Surface Distance.

wasserth commented 1 year ago

This is a good point. I will try to add this in the future.

wasserth commented 11 months ago

I added some instructions on how to train a nnunet from the public dataset and how to perform evaluation on it. This should help to make evaluation standardized between people. https://github.com/wasserth/TotalSegmentator/blob/master/resources/train_nnunet.md

fedorov commented 9 months ago

I think the dataset is mainly used by researchers doing image analysis. In this case you do not really work with DICOM images, but you typically convert them to nifti or nrrd to really work with them in python. But this conversion process can already be a big pain. At least that is my experience in a lot of other projects. I am always happy if I can download the data as nifti and not as DICOMs. Therefore my plan would be to host the niftis on a faster server and provide all the meta data as a csv file. And with a size of 28GB the download of the entire dataset is still manageable.

@wasserth you may want to read the prominent paper below and the recommendations developed by the authors. Maybe this statement from a Nature journal is more convincing than my arguments above. And if not, at least I tried, and it this recommendation is documented clearly here.

Many public data - sets containing images taken from preprints receive these images in low-resolution or compressed formats (for example, JPEG and PNG), rather than their original DICOM format. This loss of resolution is a serious concern for traditional machine learning models if the loss of resolution is not uniform across classes, and the lack of DICOM metadata does not allow exploration of model dependence on image acquisition parameters (for example, scanner manufacturer, slice thickness and so on).

Roberts, M., Driggs, D., Thorpe, M., Gilbey, J., Yeung, M., Ursprung, S., Aviles-Rivero, A. I., Etmann, C., McCague, C., Beer, L., Weir-McCall, J. R., Teng, Z., Gkrania-Klotsas, E., Ruggiero, A., Korhonen, A., Jefferson, E., Ako, E., Langs, G., Gozaliasl, G., Yang, G., Prosch, H., Preller, J., Stanczuk, J., Tang, J., Hofmanninger, J., Babar, J., Sánchez, L. E., Thillai, M., Gonzalez, P. M., Teare, P., Zhu, X., Patel, M., Cafolla, C., Azadbakht, H., Jacob, J., Lowe, J., Zhang, K., Bradley, K., Wassin, M., Holzer, M., Ji, K., Ortet, M. D., Ai, T., Walton, N., Lio, P., Stranks, S., Shadbahr, T., Lin, W., Zha, Y., Niu, Z., Rudd, J. H. F., Sala, E., Schönlieb, C.-B. & AIX-COVNET. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199–217 (2021). https://doi.org/10.1038/s42256-021-00307-0.