Suggestion for an improvement of the weights download process

fedorov commented 1 year ago

First, thank you again for developing this wonderful tool!

Our analysis workflow, of course, needs to download the weights. And, as everyone knows, downloading them from Zenodo is absolutely not practical. We did find these lines in your code:

https://github.com/wasserth/TotalSegmentator/blob/2226a051b999f7ccb42be6398d6f7da31dbf93ba/totalsegmentator/libs.py#L103-L132

and initially replicated the same workaround. However, it is highly sub-optimal, since download is done via HTTP, from an unknown server, the content of that server cannot be tracked or verified, and this capability can be pulled back any time.

To improve this, we experimented with attaching weights files as binary attachments to a Github release (see documentation here: https://docs.github.com/en/repositories/releasing-projects-on-github/about-releases#storage-and-bandwidth-quotas - there is a limit of 2 GB per attachment, which is sufficient to accommodate current weights). Based on those experiments, on the same connection, download of the same weights file is orders of magnitude faster from GitHub release:

Download from HTTP server:

vs download from Github release:

We can definitely do this in our own repo by attaching the weights and downloading them from our GitHub repo, since the weights are distributed under a permissive CC-BY license, but it would be ideal if they were distributed in this repository instead.

Would you consider doing this? For now, we will use our own repo (which we will not advertise), but we would be happy to switch to using this repo instead, whenever it is ready.

cc: @vkt1414 @denbonte

wasserth commented 1 year ago

This is a great idea. I did not know that you can upload unlimited number of files up to 2GB of size to github for free. I uploaded the files to a release and this will be used in the next version. This next version (v2.0.0) will contain several changes i.e. several new classes. You can see the new classes here: https://github.com/wasserth/TotalSegmentator/blob/add_nnunetv2/resources/improvements_in_v2.md and here https://github.com/wasserth/TotalSegmentator/blob/add_nnunetv2/totalsegmentator/map_to_binary.py

The new version is on the branch: add_nnunetv2

If you have some time to have a look at the SNOMED mapping for the new classes that would be great! This is the location of the mapping. I did not change it so far. https://github.com/wasserth/TotalSegmentator/blob/add_nnunetv2/resources/totalsegmentator_snomed_mapping.csv

fedorov commented 1 year ago

@wasserth thank you for the quick response and adoption of this for v2!

BUT we are using v1 and we need to have access to persistent and fast hosting of the v1 models.

Could you please make a v1 release with the corresponding weights? Otherwise, we would need to host them under our own repository, which would be extremely undesirable (but unfortunately unavoidable).

wasserth commented 1 year ago

Sure. I am currently uploading them. Should be done in 10min.

denbonte commented 1 year ago

Hey Jakob,

Thank you for your help on this - we are also using TotalSegmentator in a bunch of projects and this will help :-)

Also - can I ask what are the plans for the pip release? Were you thinking about updating the pip-installable package so that pip install totalsegmentator installs v2 by default, or about having a pip-installable package of v2? Something like pip install totalsegmentator_v2?

For reproducibility (and not to break all the projects of people that are working with v1 and maybe will want to do a fresh install of it on another machine, for consistency) I think the second (so pip install totalsegmentator_v2 or something, pretty much as the nnunet team did) would be way better!

Happy to hear your thoughts about this!

wasserth commented 1 year ago

Probably next week I will merge the v2 branch with the master branch and update the pip package. It will remain pip install totalsegmentator. If you want the old version you can simply do pip install totalsegmentator==1.5.6. I think this is the cleanest approach. From a user perspective v2 is actually pretty similar. The only "breaking" change is the order of the classes in the multilabel output.

denbonte commented 1 year ago

Hey Jakob,

I see your point. But, besides the order of the classes (obviously different), didn't you drop some of the segmentation classes (e.g., heart chambers, vetrebrae - but I might have misunderstood the v2 README!)

Assuming this is the case, say I want to use TotalSegmentator v1 to segment the heart chambers and then v2 to segment everything else. The downside of having one pip-installable package only, if the dependencies are the same, is that the user would need two conda environments (or similar) to run both.

If there were both TotalSegmentatorV1 and V2 as pip-installable packages, this would make users' life easier in my opinion.

I'm fairly sure some of the classes that got changed are being used by someone that would also love to try the new TotalSegmentator without setting up a fresh environment (e.g., to run both in a single pipeline).

Of course, this is just my dev point of view - feel free to disregard it, but I think it can be useful to have these sort of discussions with the community before pushing such a big change 🙃

kbressem commented 1 year ago

I agree with @denbonte that users might want to use V1 and V2 in parallel (at least I do 😄).

Regarding hosting the weights, maybe HuggingFace could also be another option? It's free, fast and with no size limit. Connection is HTTPS and they also add safe tensors to the repo, so no malware can be hidden in the pickled weights. For example, Microsoft hosts a pretrained ResNet50 at HF (https://huggingface.co/microsoft/resnet-50).

fedorov commented 1 year ago

@wasserth thank you for uploading model weights as release attachment - I consider the issue I raised as resolved!

fedorov commented 1 year ago

@wasserth I realized that I missed one key point in the suggestion of this improvement.

I think it is absolutely critical that weights continue to be maintained in Zenodo, and the attachments in the release match the weights available in Zenodo.

GitHub releases, unfortunately, cannot replace Zenodo, since, once the release has been published, it is possible to remove or replace those binary attachments, while in Zenodo once a file is published, it is by design very difficult to remove it and it is impossible to tamper with any file without triggering a new Zenodo dataset version.

I strongly encourage you to keep publishing the updated weights as part of the existing Zenodo record, or in new records, and pointing to those Zenodo descriptors from the release to document the provenance of those binary attachments.

denbonte commented 1 year ago

Dear All,

GitHub releases, unfortunately, cannot replace Zenodo, since, once the release has been published, it is possible to remove or replace those binary attachments, while in Zenodo once a file is published, it is by design very difficult to remove it and it is impossible to tamper with any file without triggering a new Zenodo dataset version.

I strongly encourage you to keep publishing the updated weights as part of the existing Zenodo record, or in new records, and pointing to those Zenodo descriptors from the release to document the provenance of those binary attachments.

Plus, Zenodo DOI will guarantee a potential citation everytime someone uses the weights for further modelling or in a third-party pipeline - which I think it's only fair.

wasserth / TotalSegmentator

Suggestion for an improvement of the weights download process #172