The last update of Platipy produces slightly different results

denbonte commented 11 months ago

Hey Phil,

Good day! I hope you are doing great 🙃

As always, thanks so much for developing and maintaining Platipy - you know we are big fans!

As part of MHub we now have an automated pipeline we implemented to make sure every container is up to date and produces the expected results after we push potentially breaking changes to the core libraries, the model Docker images, and so on.

The automated pipeline just spotted slight differences between all of the segmentation masks we have as a reference (exported before the last Platipy update - I guess the last merge https://github.com/pyplati/platipy/commit/847b3f3c62368e5256571ccef5e870bd54578b04) and the segmentation masks the current version of Platipy produces.

Some screenshots from Slicer:

Although these are not big changes, I just wanted to point them out here because I'm not sure you guys noticed them. Looking into the commit, it seems like switching from np.product to np.prod yields different results (although I'm not sure it should? I thought one was an alias for the other, and it was kept around for legacy reasons).

Also, as you can imagine, on smaller structures this yields a dice of 0.5 or so (between last week's Platipy and the current version). CC'ing @LennyN95 since he'll be for sure interested in this!

*edit: just to be clear, we deduced the change should be down to Platipy simply because you guys just merged that PR and we did not change anything that should yield these changes!

pchlap commented 11 months ago

Hi @denbonte, thanks a lot for alerting us about this. I think it's really cool that your automated pipelines pick up things like this, super useful!

I am very curious about the cause here. I will certainly investigate further but my initial reaction is that surely there must be something else going on other than the latest merge of updating the np.product to np.prod.

One thing I have observed in the past is when certain 3rd party libraries are updated there can be some knock-on effects into things like this. I remember once seeing that different versions of scikit-image would cause platipy to render RTSTRUCT contours into binary masks with some slight differences in the boundary of the contours. No yet exactly sure which library might be having such an effect in this scenario but perhaps something similar is going on...

I will let you know what I find. I'll tag @rnfinnegan here, maybe he might have some ideas as to the cause?

denbonte commented 11 months ago

Hey Phil,

I think it's really cool that your automated pipelines pick up things like this, super useful!

Thanks 🙃

my initial reaction is that surely there must be something else going on other than the latest merge of updating the np.product to np.prod.

Yeah, that was also what me and Leo immediately thought - but the coincidence was too strange not to be mentioned (as I said, as of now we rebuild and test containers very often, and this happened basically overnight).

One thing I have observed in the past is when certain 3rd party libraries are updated there can be some knock-on effects into things like this. I remember once seeing that different versions of scikit-image would cause platipy to render RTSTRUCT contours into binary masks with some slight differences in the boundary of the contours. No yet exactly sure which library might be having such an effect in this scenario but perhaps something similar is going on...

Very good guess!

Once again, it's not something major (as far as we tested). We just wanted to let you know before somebody else using Platipy extensively comes to you and mentions this out of the blue!

Feel free to continue the discussion over e-mail if you prefer (you have Leo's e-mail already I believe - otherwise, we have our addresses in the contact page at mhub.ai!)

denbonte commented 9 months ago

Hello again @pchlap @rnfinnegan :-)

We found some more bizarre behaviors of Platipy when testing a freshly built container using our automated pipeline.

The difference this time is that every run now produces slightly different results! For instance, we got this by running the same container thrice in a row (we got an alert since the Dice Coefficient of most of the structures is off from an almost negligible 1-2% to something more significant).

Run 1

Run 2

Run 3

The three runs overlapped

For your interest, we are running platipy==0.6.1.

Here's a link to a Google Drive folder with the segmentation masks displayed above (and the CT scan they were derived from, the Chest CT public sample from the 3DSlicer collection) saved as DICOM SEG (.seg.dcm) for transparency purposes:

https://drive.google.com/drive/folders/1LAb9zEph2wFoRoNwjI0PlsOeFH8esZ2P?usp=sharing

Since we wanted to make sure the problem was not on us or on something we were doing inside the MHub container, we also ran Platipy from CLI a bunch of times and got different results each run again:

platipy segmentation cardiac image.nii.gz -o test1
platipy segmentation cardiac image.nii.gz -o test2
platipy segmentation cardiac image.nii.gz -o test3

I have uploaded these files to Google Drive as well.

As you know by now, we are very much Platipy fans - and we would love to have the model in our collection :-) for this to happen, we need to ensure the output is consistent within the same version of the pip installation (between different versions, that's not a huge issue, as we will be versioning the Docker images as well).

LennyN95 commented 9 months ago

Hi Phil,

I hope you're doing well!

I've prepared a colab notebook to demonstrate the problem. We install Platipy via !pip install platipy[cardiac] and run it three times on a public example patient from IDC.

https://colab.research.google.com/drive/1drxR5pGQuleKxPQIZjKdKnHZAkIc3l8I?usp=sharing

I hope this helps to reproduce (and hopefully resolve) the issue ;)

After all, Platipy was one of the very first PoC models we implemented on MHub and we're very interested in maintaining it on our platform.

pchlap commented 9 months ago

Hi @denbonte & @LennyN95,

thanks so much for your support in debugging here. I have a suspicion about what could be causing this, but will have to discuss with @rnfinnegan to confirm and figure out a solution.

The cause of the small differences we used to see was that the demons deformable registration was non-deterministic when using multiple cores to compute the registration. We have a setting in our extensive config which we set to 8 cores which would result in the small differences we previously saw.

However, when we set ncores to 1, this issue used to be resolved (even if the algorithm was slow to run). This ncores setting is propagated to the SetNumberOfThreads function in the SimpleITK FastSymmetricForcesDemonsRegistrationFilter object. However, while that function still exists, it seems like this is no longer being respected for how many cores actually should be used. On a system with lots of CPUs, all cores are used so lots of multithreading is happening, and perhaps this is the reason for the larger differences between runs.

I still need to confirm this, there must be a way to force the use of a single core and confirm that the results are indeed consistent between runs... This will be slow, and none of this is ideal. I will talk to Rob about how we might be able to ensure stable deformable registration and solve this once and for all (i'm thinking switching up the DIR method).

I'll chat to Rob tomorrow and will be in touch soon with hopefully a proposal for a solution :)

denbonte commented 9 months ago

Thank you Phil for the swift reply and for looking into this!

If this can help: the machine we ran this on (i.e., the one that yielded the results in the Drive folder) is equipped with a 64-cores Ryzen - so what you wrote makes a lot of sense. However, the Colab instances are usually equipped with a single CPU (i.e., 2vCPUs, basically logical cores) or at most with a couple.

It might depend on the configuration @LennyN95 ran the notebook on, but I'm pretty convinced it's still at most a couple of physical cores.

I can also try to test it later forcing the Docker container to run on one core only (it should be feasible, as Docker provides good resource management tools built-in).

I will keep this issue updated and let's see how to move forward!

denbonte commented 9 months ago

However, when we set ncores to 1, this issue used to be resolved (even if the algorithm was slow to run). This ncores setting is propagated to the SetNumberOfThreads function in the SimpleITK FastSymmetricForcesDemonsRegistrationFilter object. However, while that function still exists, it seems like this is no longer being respected for how many cores actually should be used. On a system with lots of CPUs, all cores are used so lots of multithreading is happening, and perhaps this is the reason for the larger differences between runs.

I tested the Platipy MHub container restricting the CPU resources to one logical core using --cpuset-cpus=42 (I checked everything was working as intended with htop and I can confirm the whole pipeline used just that logical core!) and I still got results that are not consistent between different runs!

The processing time does not look that bad BTW (around 10 minutes overall, on a single core + GPU!)

For the sake of completeness, following up on what I wrote above regarding the Colab notebooks, this is the CPU HW on the standard (free) Colab instances:

lscpu

>> Architecture:            x86_64
>>   CPU op-mode(s):        32-bit, 64-bit
>>   Address sizes:         46 bits physical, 48 bits virtual
>>   Byte Order:            Little Endian
>> CPU(s):                  2
>>   On-line CPU(s) list:   0,1
>> Vendor ID:               GenuineIntel
>>   Model name:            Intel(R) Xeon(R) CPU @ 2.00GHz
>>     CPU family:          6
>>     Model:               85
>>    Thread(s) per core:  2
>>    Core(s) per socket:  1
>> ...

While this is the CPU HW on the most premium (non-free) Colab instances (with 80+ gigs of RAM and a NVIDIA A100):

lscpu

>> Architecture:            x86_64
>>   CPU op-mode(s):        32-bit, 64-bit
>>   Address sizes:         46 bits physical, 48 bits virtual
>>   Byte Order:            Little Endian
>> CPU(s):                  12
>>   On-line CPU(s) list:   0-11
>> Vendor ID:               GenuineIntel
>>   Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
>>     CPU family:          6
>>     Model:               85
>>     Thread(s) per core:  2
>>     Core(s) per socket:  6

pchlap commented 9 months ago

Thanks for the investigation @denbonte! OK so we can rule out the multiprocessing issue... But there is something going during the atlas best registration phase... I will investigate the DIR method used and some others available and will run a few experiments.

pchlap commented 9 months ago

I think I have found the culprit! It actually wasn't in the DIR, but in the rigid registration that is performed prior to the DIR. It turns out it was this setting that introduces the randomness. By specifying a seed number in the SetMetricSamplingPercentage function, this rigid registration is consistent across runs.

What I still can't answer, is why this issue has presented itself (or at least become more severe) so recently. I am testing this with previous versions of SimpleITK and this appears to be the case since SimpleITK >= 2.0.0...

I am just running a few more tests and will prepare an update. I plan to merge this together with a branch that @rnfinnegan has been working on to optimize some of the DIR (#198). This would have very slightly modified the resulting contours anyway so I think it's a good time to add this and hopefully have a stable Platipy cardiac tool from version 0.7.0 onwards.

I'll let you know how it goes, let me know if there are any concerns around any of this.

pchlap commented 9 months ago

OK I think this is looking good, the results are now exactly the same between each run and the segmentation quality is looking good. We can run with as many cores as we want too and thats not affecting the results. So I will look at merging #198 and creating release 0.7.0 in the next couple of days. Let me know in case I have missed anything, you can test out the recitfied version using:

pip install git+https://github.com/pyplati/platipy.git@visualisation-and-registration-updates

Again a massive thanks to @denbonte and @LennyN95 for identifying this issue and helping debug. Really appreciated.

pchlap commented 9 months ago

And for those interested, this was the change that solved everything: https://github.com/pyplati/platipy/pull/198/commits/f05a5f7c73a8975af91014cdcbc421f18434cc2b

denbonte commented 9 months ago

Hi again @pchlap - and thank you and @rnfinnegan so much for looking into this so fast :-D

So I will look at merging https://github.com/pyplati/platipy/pull/198 and creating release 0.7.0 in the next couple of days. Let me know in case I have missed anything, you can test out the recitfied version using:

Amazing 🎉

What I still can't answer, is why this issue has presented itself (or at least become more severe) so recently. I am testing this with previous versions of SimpleITK and this appears to be the case since SimpleITK >= 2.0.0...

That's a good point, indeed. For your interest, I found these relevant issues while surfing on GitHub and Google a bit, looking for similar errors.

https://github.com/InsightSoftwareConsortium/SimpleITK-Notebooks/issues/308 https://github.com/SimpleITK/SimpleITK/issues/342

This also looks like it could be related (it's ITK-based after all): https://github.com/ANTsX/ANTs/wiki/antsRegistration-reproducibility-issues

I guess we can blame this on SimpleITK's non-intuitive behaviour, and poor (or non-existent) python documentation, and a C++ documentation that is barely navigable - although it remains to be understood why this behaviour popped out of nowhere (we have been using platipy for MHub testing for months now, and have never noticed anything similar).

I feel like we should probably report this issue somewhere, as this could be huge for people trying to deploy reproducible pipelines based on SimpleITK/ITK (i.e., most of the pipelines doing registration using open software, I guess) - although I don't know what the best place would be.

Also,

And for those interested, this was the change that solved everything: https://github.com/pyplati/platipy/commit/f05a5f7c73a8975af91014cdcbc421f18434cc2b
registration.SetMetricSamplingPercentage(sampling_rate, seed=42)
silent appreciation for the 42.

P.S. - CC'ing @fedorov as I think this could be interesting for him to at least know!

pchlap commented 9 months ago

PlatiPy 0.70 is now released and available on PyPi 🚀

That's a good point, indeed. For your interest, I found these relevant issues while surfing on GitHub and Google a bit, looking for similar errors.

InsightSoftwareConsortium/SimpleITK-Notebooks#308 SimpleITK/SimpleITK#342

This also looks like it could be related (it's ITK-based after all): https://github.com/ANTsX/ANTs/wiki/antsRegistration-reproducibility-issues

Thanks for digging out these links @denbonte, indeed that does seem to be related to the issue we discovered. However as you say, it doesn't indicate why it happened all of a sudden.

I feel like we should probably report this issue somewhere, as this could be huge for people trying to deploy reproducible pipelines based on SimpleITK/ITK (i.e., most of the pipelines doing registration using open software, I guess) - although I don't know what the best place would be.

Yeah, I also don't really know the best place. Perhaps the best we can do is rename the title of this issue to something like: SimpleITK image registration random sampling causing inconsistent results across runs in cardiac segmentation This might allow those who Google this problem in the future to find our discussions here and ultimately the solution...

fedorov commented 9 months ago

I wonder if Matt McCormick aka @thewtex has any thoughts about this, or might be interested to at least know about this finding!

denbonte commented 9 months ago

Dear @pchlap

Sorry for yet another message 😅

I don't think this is down to Platipy, but I believe you guys might find this helpful as well.

We noticed platipy==0.7.0 (and I assume other versions as well, though I haven't tested; I believe this could be related to SimpleITK, again) with SimpleITK==2.2.1 produces different results when run with GPU or without GPU, both in a Docker container (MHub) and in a Conda environment (I set the GPU as not usable by simply running export CUDA_VISIBLE_DEVICES=""). Note that the heart segmentation (via nnU-Net) is identical, but all the other substructures are not (the Dice varies between 0.95 to as low as 0.3 sometimes).

You can find the results of all the runs here: https://drive.google.com/drive/folders/1cbnzuR0VYBkdkKBLP-_BY8CzB7W1zN55

pchlap commented 9 months ago

Thanks for these further insights @denbonte, always appreciated!

That is disappointing to see that there are still varying results under different conditions. I agree that is most likely due to SimpleITK. I will try to investigate a bit but I fear that there might not be a straightforward solution to this one...

pyplati / platipy

The last update of Platipy produces slightly different results #224