maxvanspengler / hyperbolic_learning_library

An extension of the PyTorch library containing various tools for performing deep learning in hyperbolic space.
MIT License
134 stars 9 forks source link

Torch DDP support #58

Open jhindel opened 7 months ago

jhindel commented 7 months ago

Hi, very nice package! I was wondering if you are planing to extend the modules for DDP (distributed data parallel training). For me, a model including hyperbolic Conv2D runs on a single GPU but the model cannot be converted to multi-GPU settings. The run freezes during the conversion. When replacing hyperbolic Conv2D layers with the standard pytorch Conv2D, the problem doesn't occur.

jhindel commented 7 months ago

I guess this could be the problem:

TypeError: Attempting to apply the torch function <method 'float' of 'torch._C._TensorBase' objects> on a ManifoldParameter. Use ManifoldParameter.tensor as argument to <method 'float' of 'torch._C._TensorBase' objects> instead.

maxvanspengler commented 7 months ago

Thanks! Yes, having the library work with the torch.distributed package is definitely something we want, so thanks for raising the issue :)

The error that you found is indeed a good indication of what's going wrong, so thanks for adding it. We don't allow using all of the usual torch.Tensor methods on the ManifoldTensor, ManifoldParameter and TangentTensor classes directly as many of these can lead to some mathematically undefined behaviour. Instead we only define a few of the allowed methods ourselves and have a catch-all that throws the error that you are seeing anytime one of the other torch.Tensor methods is used. The problem is that the list of torch.Tensor operations is rather long, so we decided to only add new ones when we have an explicit need for them, which is clearly the case here, as DDP relies on some that we don't allow. So to get this to work, we will need to add some additional methods to our own tensor classes to allow for example DDP to handle them.

I'll have to find some time to dive into torch.distributed to see what we'll need to add and to find out how difficult this will be.