Using too much gpu memory while training on addernet with ImageNet?

Tsings04 commented 4 years ago

I try to train the addernet using resnet18 for ImageNet from scratch, with 4 1080Ti cards, but it just occupies too much memory that i could only set the batch_size to 16, and it's also too too slow..

For comparision, I have tired to replace the adder filters with normal conv filters and the 4 gpu cards could load 128 batch size. Did i setup wrong, or is that the normal case currently for addernet?

Have you guys tried to train with ImageNet?

deepdarkfans commented 4 years ago

I only have one 1660ti cards and set the batch_size to 64. and it runs successful.

HantingChen commented 4 years ago

It is the normal case currently for addernet.

Tsings04 commented 4 years ago

Hi Hanting, I am trying to reproduce the experiments. When i train the BNN networks for cifar dataset using the setup in paper (SGD lr 0.1, momentum 0.9, weight decay 0.0005; batch size 256, epochs 400), but all BNN networks(VGG, ResNet20, ResNet32) just not achieve the accuracy in paper. Do you have any advise for the training for BNN? I have never tried BNN before :p

HantingChen commented 4 years ago

We use the Dorefa-Net for training BNN. Which method do you use?

Tsings04 commented 4 years ago

I see! I just follow the original binary networks to build those networks, but hard to train them. I would try the dorefa net to do so. Thx!

Tsings04 commented 4 years ago

I have tested the training and validation time of LeNet with different filters on MNIST, but it seems that the LeNet with adder filters lags so much behind even on validation using CPU without backward. It seems that the actual performance of adder filter doesn't match its theoretical improvement in speed. :(

Have you also got this result in your experiments?

Compare_gpu_cpu

HantingChen commented 4 years ago

Yes. The implement of conv is acclerated by several techniques so that the adder filter cannot achieve accleration without these techniques for now.

Tsings04 commented 4 years ago

Thank you for your reply! It confused me in the experiments of addernet but now i got it :)

Tsings04 commented 4 years ago

Loss_acc_on_cifar10

These are the experiments on cifar10, and weird is that AdderNet seems to overfit the train set and perform worse on the val set during the early period, and only start to learn some overall feature later after 300 epochs training. Do you know why the training curves look like that?

HantingChen commented 4 years ago

The magnitude of outputs in AdderNets are large, so the varience and mean counted in BN is inaccurate when the learning rate is not small.

ranery commented 4 years ago

Could you also show your training trajectory and achieved test ac curacies for CIFAR-100 dataset, thanks!

HantingChen commented 4 years ago

It is not easy for me to transport any code or document to outside, which would cost a long time for auditing.....

You can modify the training code to train CIFAR-100 by yourself. And you can ask me if you have any problem.

Tsings04 commented 4 years ago

Loss_acc_on_cifar100 Yes, here you go. This is the reproduction from us, but i would note that the VGG model we use is different from the original VGG_small of the paper. We reduce the filter number to shorten the training time, otherwise we have to train the VGG_small with adder filters for about 9 days on our server.

huawei-noah / AdderNet

Using too much gpu memory while training on addernet with ImageNet? #23