Open skeydan opened 4 years ago
adding LBFGS
just out of enthusiasm ;-)
(not using anywhere yet but suspect it could be real good for some types of data)
Can I add LBFGS or is anyone already working on this?
Why don't you use the C++ implementations of the optimizers?
I am not working on LBFGS. Pinging @krzjoa as he worked on many other implementations.
The main reason for not using the C++ implementations was that I wanted to allow extensions in the R side, and It would tricky to call custom R functions if using the C++ class. Other reasons are
That said, we can have optimizers that instantiate the C++ object when initializing and then having the step
method calling the C++ step method.
I haven't started working on LBFGS yet, so feel free to implement it, @dirkschumacher :wink:
Ok, given the complexity of LBFGS I would rather use an existing implementation than rewriting it in R. I will take a look and come back with questions :)
In this place I would like to propose an initial road map of torch
optimizers.
In my opinion, our first goal could be to implement all optimizers , that are curently present in PyTorch and Keras.
PyTorch optimizers:
Keras optimizers (non occurring in PyTorch):
Additionaly, there is a couple of MXNet optimizers which didn't appear above.
Some other fancier optimizers could be implemented in a separate library like pytorch-optimizer.
Maybe it would make sense to extract the optimizers into a different package anyways and reexport them in the torch
package.
I understand your intention, but personally I would not recommend that (i.e. extracting and reexporting whole optim module). Optimizers cannot work without torch
, so we'd create a cyclic graph of dependencies. 😉
Yeah, I agree. Cyclic dependencies are considered harmful :)
Sounds good to me! +1 for a new package for fancier optimizers! that would be really cool
Hi @dfalbel ,
The main reason for not using the C++ implementations was that I wanted to allow extensions in the R side, and It would tricky to call custom R functions if using the C++ class. Other reasons are
many of them are simple enough to be reimplemented. the C++ implementation doesn't support parameter groups.
That said, we can have optimizers that instantiate the C++ object when initializing and then having the step method calling the C++ step method.
Would it be possible to have optimizers that use the C++ objects? For example Adam or AdamW which are both available in libtorch. I believe this could speed up my training quite a bit since it seems that the optimizer$step() is a bottleneck compared to pytorch. Also if I'm reading the cpp docs correctly it seems C++ now supports parameter groups.
Yes, In theory that's possible. I'll draft something in that direction and post here.
Ideally I'd really want to figure what's slowing down optimizers in R compared to PyTorch, because being able to inherit from other optimizers and etc is really useful for research and verification. Still I don't see what could be causing this, maybe the way we keep state? Anyway, I think having those C++ based optimizers is good.
@egillax Here's a POC binding directly to the C++ LibTorch optimizers:
https://github.com/dfalbel/torchoptx
I didn't benchmark at all, but curious to see if that's much faster than the R based optimizers. Currently only SGD and Adam are supported, but shouldn't be a lot of work to add the others. We would also need to figure out how to support serialization.
Hi @dfalbel,
That was quick! But I can't open the repo, is it by any chance private?
Ohh sorry! Just made it public.
@dfalbel some preliminary results for a small ResNet run for 20 epochs on the same random data
With torch R Adam optimizer:
Average time per epoch was: 0.854 secs
With the torchoptx c++ Adam
Average time per epoch was: 0.513 secs
And with pytorch
Average time per epoch: 0.403 seconds
So it's quite a bit faster than before and closer to pytorch! And it's a drop in replacement.
I'll test it more tomorrow, I have a Transformer model which showed larger differences between pytorch and torch in R. I'll also post the code I use for testing if anyone is curious.
@egillax Nice, this sounds promising! I think we should be able to achieve very similar from R - specially on GPU where the operations are non-blocking.
This will probably make more difference the more parameters the network has though.
Adam
(unsurprisingly I guess)thanks!