lezcano / geotorch

Constrained optimization toolkit for PyTorch
https://geotorch.readthedocs.io
MIT License
657 stars 34 forks source link

seems to fail with large tensors #30

Closed renjithravindran closed 2 years ago

renjithravindran commented 2 years ago

Hi, I was testing geotorch to do some SVD. unfortunately registering orthogonal parametrization on a large embedding layer (30000x50) takes around 20 mins. and gets killed when the training starts. FYI: this is on pytorch(1.10.1) CPU

Is there anything that can change this?

Thanks

lezcano commented 2 years ago

As it stands, alas, that is expected. Now it'd be possible to get around that by using the class used within https://pytorch.org/docs/master/generated/torch.nn.utils.parametrizations.orthogonal.html In this one, there's the orthogonal_map="householder" option that should be notably more efficient for very tall matrices like yours. In particular, you should use the _Orthogonal class within https://pytorch.org/docs/master/_modules/torch/nn/utils/parametrizations.html#orthogonal in place of geotorch.SO together with the option use_trivialization=False. That may work much better for your problem, as all the other options would instantiate a matrix of size 30000 x 50 and that one wouldn't.

I'd recommend you monkey-patch your way to victory here. This would mean: take the LowRank class for example and overwrite the static method def manifolds(n, k, rank, tensorial_size, triv): with one that returns two torch.nn.utils.parametrizations._Orthgonals rather than Stiefel. I'd reckon that should do (modulo perhaps monkey-patching the _Orthogonal class with a couple extra methods.

renjithravindran commented 2 years ago

again thanks for your lightning fast responses! So you suggest to make this work I use the _Orthogonal class from ..utils.parametrizations.. instead geotorch.SO with map=householder and trivialization=False. This much is clear!

But I dont understand the next steps, how is the LowRank class associated with what I am trying to do?

As of now I have a fair intuitions about the math behind these, so I think i can monkey-patch as required. However, do you think with right modifications I should be able to have reasonable computational costs for the size of matrices i am interested in?

thanks a lot!

renjithravindran commented 2 years ago

Also with these modifications, will things more or less look like the technique described here?

lezcano commented 2 years ago

I meant LowRank as an example of a class that does some SVD-like factorisation. I figured you were using one from geotorch.

If you're using the Stiefel class within your own class, then things are much simpler. Simply replace it with the _Orthogonal class from PyTorch and you should be good.

lezcano commented 2 years ago

About that paper, in some sense, yes. That paper has a number of years, and I'm not 100% sure that their implementation at the time was correct. Now, I'm pretty certain that the implementation in _Orthogonal is correct, and it should be fairly efficient, as it uses cuBLAS behind the scenes.

renjithravindran commented 2 years ago

Okey, actually I am trying to to do SVD with gradient descent. But what I am really trying to do is Tucker decomposition, SVD is only a first step.

Let me try out your suggestions.

Thanks!!

lezcano commented 2 years ago

Any news?

renjithravindran commented 2 years ago

Hi Lezcano Glad you asked. I haven't got in to trying what you suggested. I broke the work that I was doing into two parts, one that could use more classical way (HOOI) of doing tucker and the other using gradients. And have been busy with the first part. But mean while I did try using an implementation of the householder matrix technique for orthogonality, but that also gave me an OOM .

Thanks

lezcano commented 2 years ago

That technique is not the same as the one implemented in parametrizations.orthogonal in core PyTorch. I very much encourage you to use the one in core PyTorch, see if it works for your use case.

renjithravindran commented 2 years ago

Yes, i do intend to try it out! Thanks

On Sun, 17 Apr 2022, 5:56 am Lezcano, @.***> wrote:

That technique is not the same as the one implemented in parametrizations.orthogonal https://pytorch.org/docs/master/generated/torch.nn.utils.parametrizations.orthogonal.html?highlight=orthogonal#torch.nn.utils.parametrizations.orthogonal in core PyTorch. I very much encourage you to use the one in core PyTorch, see if it works for your use case.

— Reply to this email directly, view it on GitHub https://github.com/Lezcano/geotorch/issues/30#issuecomment-1100776809, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5EKAVGDPFBC7XJZNHZWLLVFNLB3ANCNFSM5LV2E7QA . You are receiving this because you authored the thread.Message ID: @.***>

lezcano commented 2 years ago

any news on this end?

lezcano commented 2 years ago

Closing for now as this is expected. We should consider adding the householder parametrisation from core and just roll with that one.