jaewoosong / pocketnn

The official, proof-of-concept C++ implementation of PocketNN.
MIT License
31 stars 7 forks source link

PocketReLU8bit & PockentSigmoid works better with modified divisors #1

Closed Dueschen closed 2 years ago

Dueschen commented 2 years ago

Hi dear authors, thank you for uploading codes for your great paper!

I noticed that the divisors before some activation functions are different. E.g., no divisor before PocketReLU8bit, PocketSigmoid uses (1 << k), and PocketTanh uses (1 << k) * numItems. It seems like (1 << k) * numItems matches the scale of the partial sum better, thus I tried to modify others.

With divisor(1 << k) * numItems, it seems that PocketReLU8bit works much better (11%→95%), and PocketSigmoid is also somewhat improved (91%→96%). (I didn't try on all activation functions)

Exp 1: PocketReLU8bit (layer 1 & 2) + PocketTanh (last layer)

void pktactv::pocketReLU8Bit(pktmat& matOut, pktmat& matIn, pktmat& matActvGradInv, int k, int numItems) {
    ...
    const int divisor = (1 << k) * numItems; // Modified

    for (int r = 0; r < matOut.rows(); ++r) {
        for (int c = 0; c < matOut.cols(); ++c) {
            int currElem = matIn.getElem(r, c) / divisor; // Modified
            if (currElem < minVal) {
                ...

Exp 2: PocketSigmoid (layer 1 & 2) + PocketTanh (last layer)

void pktactv::pocketSigmoid(pktmat& matOut, pktmat& matIn, pktmat& matActvGradInv, int k, int numItems) {
    ...
    const int divisor = (1 << k) * numItems; // Modified
    const int slopesInv[7] = { PKT_MAX, 8, 2, 1, 2, 8, PKT_MAX };
    ...
jaewoosong commented 2 years ago

Hi @Dueschen ! Nice to meet you. Thank you for your comments and experiment.

I was in a hurry when I was writing the code. So I have to guess about myself why I wrote the code in that way. 😄

Well, I think at first I did not divide by numItems because I worried that if there were too many items in one minibatch, everything might become zero. If numItems is too big, the divisor will make everything zero.

But in the usual cases, as in the MNIST and Fashion-MNIST examples, usually numItems is less than 100. So divisor = (1 << k) * numItems will work well. I think this is what happened on pocketTanh.

If a new logic can be added to check whether numItems is too big or not, it will improve the code I think!

Well, this is the first draft of my thoughts and I might missed something. Please let me know anything that I missed! Thank you again for your valuable comments.