Closed Dueschen closed 2 years ago
Hi @Dueschen ! Nice to meet you. Thank you for your comments and experiment.
I was in a hurry when I was writing the code. So I have to guess about myself why I wrote the code in that way. 😄
Well, I think at first I did not divide by numItems
because I worried that if there were too many items in one minibatch, everything might become zero. If numItems
is too big, the divisor
will make everything zero.
But in the usual cases, as in the MNIST and Fashion-MNIST examples, usually numItems
is less than 100. So divisor = (1 << k) * numItems
will work well. I think this is what happened on pocketTanh
.
If a new logic can be added to check whether numItems
is too big or not, it will improve the code I think!
Well, this is the first draft of my thoughts and I might missed something. Please let me know anything that I missed! Thank you again for your valuable comments.
Hi dear authors, thank you for uploading codes for your great paper!
I noticed that the divisors before some activation functions are different. E.g., no divisor before PocketReLU8bit, PocketSigmoid uses
(1 << k)
, and PocketTanh uses(1 << k) * numItems
. It seems like(1 << k) * numItems
matches the scale of the partial sum better, thus I tried to modify others.With divisor
(1 << k) * numItems
, it seems that PocketReLU8bit works much better (11%→95%), and PocketSigmoid is also somewhat improved (91%→96%). (I didn't try on all activation functions)Exp 1: PocketReLU8bit (layer 1 & 2) + PocketTanh (last layer)
Exp 2: PocketSigmoid (layer 1 & 2) + PocketTanh (last layer)