giotto-ai / giotto-deep

Deep learning made topological.
Other
76 stars 11 forks source link

Parallel training #133

Open yorickbrunet opened 9 months ago

yorickbrunet commented 9 months ago

Types of changes

Description This work introduces the parallelisation of giotto-deep on multiple GPUs via two methods: pytorch's FSDP and pipeline tools.

The version of pytorch is increased to 1.13.1 to support the necessary features of FSDP.

To allow the parallelisation to be efficiently run, a new sampler GiottoSampler is defined that combines DistributedSampler and SubsetRandomSampler.

A benchmark tool allows running a model with different batch sizes on different GPUs and number of GPUs to compare the parallelised and not-parallelised runs. A generator of Kubernetes pods takes some GKE details as input and outputs the pod configurations, allowing a user to build its own configurations for its own cluster.

Any other comments?

Checklist

review-notebook-app[bot] commented 9 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

CLAassistant commented 9 months ago

CLA assistant check
All committers have signed the CLA.

raphaelreinauer commented 9 months ago

@yorickbrunet Thanks for your contribution. Please let me know when the PR is ready for review.

yorickbrunet commented 9 months ago

@raphaelreinauer All the tests pass on linux containers. I suppose you can start your review. Thanks!

raphaelreinauer commented 9 months ago

@yorickbrunet Thanks for your contribution. Please let me know when the PR is ready for review.

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

matteocao commented 9 months ago

@raphaelreinauer I see that @yorickbrunet answered to all your comments. Are you satisfied or is there anything else you would like to discuss? If there is nothing more, I'll merge the PR.

raphaelreinauer commented 9 months ago

@matteocao thanks for checking in and thanks @yorickbrunet for your answers so far.

However, there's one key aspect that still needs attention - the PR description:

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

@yorickbrunet Could you please add a detailed description of all the features you added, then I can do a detailed PR review.

yorickbrunet commented 8 months ago

@raphaelreinauer we answered or fixed all comments. Can you have another look, close the issues that can be closed, and maybe approve the PR? Thanks

raphaelreinauer commented 8 months ago

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

Hi @yorickbrunet, could you please provide a detailed description of the features added in this PR to aid my review process? Thanks

yorickbrunet commented 8 months ago

Hi @yorickbrunet, thank you so much for your hard work on this! 😊 I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.

Hi @yorickbrunet, could you please provide a detailed description of the features added in this PR to aid my review process? Thanks

Hi @raphaelreinauer. I improved the general description of the PR, but the important description was there: This work introduces the parallelisation of giotto-deep on multiple GPUs via two methods: pytorch's FSDP and pipeline tools. Even though there are 52 files modified, the modifications are quite small in many of them. The most interesting file is gdeep/trainer/trainer.py, where most of the modifications were done.

raphaelreinauer commented 8 months ago

Thanks, @yorickbrunet, for the changes. I'll review the changes by next week Friday.

raphaelreinauer commented 7 months ago

Unfortunately, I didn't have time to review the PR last weekend, but I'll do it this weekend. Sorry for the delay.

raphaelreinauer commented 6 months ago

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

yorickbrunet commented 6 months ago

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

Hi @raphaelreinauer I answered all comments with either a modification of the code or some justification why it cannot/won't be modified. Basically, the project is closed and we cannot spend two more weeks adding tests, retesting the setup after modification, etc.

yorickbrunet commented 6 months ago

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

@raphaelreinauer @VascoSch92 Can you please close all comments that you consider OK? So that we know where we stand. Thanks.

VascoSch92 commented 6 months ago

Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.

@raphaelreinauer @VascoSch92 Can you please close all comments that you consider OK? So that we know where we stand. Thanks.

@yorickbrunet for me is good. You can resolve the discussions ;-) (i cannot)