mlverse / torch

R Interface to Torch
https://torch.mlverse.org
Other
495 stars 65 forks source link

optimizers wishlist #147

Open skeydan opened 4 years ago

skeydan commented 4 years ago

Adam (unsurprisingly I guess)

thanks!

skeydan commented 4 years ago

adding LBFGS just out of enthusiasm ;-) (not using anywhere yet but suspect it could be real good for some types of data)

dirkschumacher commented 4 years ago

Can I add LBFGS or is anyone already working on this?

dirkschumacher commented 4 years ago

Why don't you use the C++ implementations of the optimizers?

dfalbel commented 4 years ago

I am not working on LBFGS. Pinging @krzjoa as he worked on many other implementations.

The main reason for not using the C++ implementations was that I wanted to allow extensions in the R side, and It would tricky to call custom R functions if using the C++ class. Other reasons are

That said, we can have optimizers that instantiate the C++ object when initializing and then having the step method calling the C++ step method.

krzjoa commented 4 years ago

I haven't started working on LBFGS yet, so feel free to implement it, @dirkschumacher :wink:

dirkschumacher commented 4 years ago

Ok, given the complexity of LBFGS I would rather use an existing implementation than rewriting it in R. I will take a look and come back with questions :)

krzjoa commented 3 years ago

In this place I would like to propose an initial road map of torch optimizers. In my opinion, our first goal could be to implement all optimizers , that are curently present in PyTorch and Keras.

PyTorch optimizers:

Keras optimizers (non occurring in PyTorch):

Additionaly, there is a couple of MXNet optimizers which didn't appear above.

Some other fancier optimizers could be implemented in a separate library like pytorch-optimizer.

dirkschumacher commented 3 years ago

Maybe it would make sense to extract the optimizers into a different package anyways and reexport them in the torchpackage.

krzjoa commented 3 years ago

I understand your intention, but personally I would not recommend that (i.e. extracting and reexporting whole optim module). Optimizers cannot work without torch, so we'd create a cyclic graph of dependencies. 😉

dirkschumacher commented 3 years ago

Yeah, I agree. Cyclic dependencies are considered harmful :)

dfalbel commented 3 years ago

Sounds good to me! +1 for a new package for fancier optimizers! that would be really cool

egillax commented 2 years ago

Hi @dfalbel ,

The main reason for not using the C++ implementations was that I wanted to allow extensions in the R side, and It would tricky to call custom R functions if using the C++ class. Other reasons are

many of them are simple enough to be reimplemented.
the C++ implementation doesn't support parameter groups.

That said, we can have optimizers that instantiate the C++ object when initializing and then having the step method calling the C++ step method.

Would it be possible to have optimizers that use the C++ objects? For example Adam or AdamW which are both available in libtorch. I believe this could speed up my training quite a bit since it seems that the optimizer$step() is a bottleneck compared to pytorch. Also if I'm reading the cpp docs correctly it seems C++ now supports parameter groups.

dfalbel commented 2 years ago

Yes, In theory that's possible. I'll draft something in that direction and post here.

Ideally I'd really want to figure what's slowing down optimizers in R compared to PyTorch, because being able to inherit from other optimizers and etc is really useful for research and verification. Still I don't see what could be causing this, maybe the way we keep state? Anyway, I think having those C++ based optimizers is good.

dfalbel commented 2 years ago

@egillax Here's a POC binding directly to the C++ LibTorch optimizers:

https://github.com/dfalbel/torchoptx

I didn't benchmark at all, but curious to see if that's much faster than the R based optimizers. Currently only SGD and Adam are supported, but shouldn't be a lot of work to add the others. We would also need to figure out how to support serialization.

egillax commented 2 years ago

Hi @dfalbel,

That was quick! But I can't open the repo, is it by any chance private?

dfalbel commented 2 years ago

Ohh sorry! Just made it public.

egillax commented 2 years ago

@dfalbel some preliminary results for a small ResNet run for 20 epochs on the same random data

With torch R Adam optimizer:

Average time per epoch was: 0.854 secs

With the torchoptx c++ Adam

Average time per epoch was: 0.513 secs

And with pytorch

Average time per epoch: 0.403 seconds

So it's quite a bit faster than before and closer to pytorch! And it's a drop in replacement.

I'll test it more tomorrow, I have a Transformer model which showed larger differences between pytorch and torch in R. I'll also post the code I use for testing if anyone is curious.

dfalbel commented 2 years ago

@egillax Nice, this sounds promising! I think we should be able to achieve very similar from R - specially on GPU where the operations are non-blocking.

This will probably make more difference the more parameters the network has though.