jxbz / signSGD

Code for the signSGD paper
https://arxiv.org/abs/1802.04434
80 stars 14 forks source link

Where is SignSGD performed ? #1

Open manishadubey91 opened 6 years ago

manishadubey91 commented 6 years ago

I am unable to figure out where exactly Sign of gradient is being taken into consideration (except in the toy example) ?

jxbz commented 6 years ago

Hi @manishadubey91, sorry this is unclear. You have to pass in the optimiser as a command line argument. For example:

python train_resnet.py --optim signum --lr 0.0001 --wd 0.00001

This works because signum was implemented in the mxnet deep learning framework (see this page). I can also share Pytorch code for the optimiser if that would help.

amitport commented 4 years ago

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

jxbz commented 4 years ago

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

Hi @amitport, you're right and thanks for pointing this out. In this paper, we used an implementation of the sign function that quantised positiive gradients to +1, negative gradients to -1, and 0 gradients to 0. I think this was done at the time under the (naïve) assumption that a gradient component exactly zero was unlikely to occur in practice. I'm planning to run some experiments to test if/how much this makes a difference to convergence, and will report back.

jxbz commented 3 years ago

Hi @amitport, I tested the difference between the version that sends sign(0) --> 0 and the version that sends sign(0) --> ±1 at random. The tests and results are in this Jupyter notebook. At least for training Resnet-18 on CIFAR-10, there was little difference between the two implementations.

That being said, in the distributed experiments in the ICLR 2019 paper, we used an implementation of the sign function that maps sign(0) --> +1 deterministically. So if this issue still bothers you (it bothers me) then it's safer to look at the experimental results in that paper. The compression in that paper is carried out in bit2byte.cpp which gets called by compressor.py.

amitport commented 3 years ago

@jxbz thank you. I just wanted to make sure I understand what was used in the graphs which I guess is one bit sign.

In any case, we can probably agree that ternary sign {-1, 0, 1} is significantly better than one bit sign so the distinction is meaningful. (And also that randomizing 0 is a big improvement over simple one bit sign).