huqinghao / PalQuant

12 stars 1 forks source link

Loss nan for w1a1g3 #3

Open ChuanjunLAN opened 1 year ago

ChuanjunLAN commented 1 year ago

Thx for your work again, i have tried your default config for w4a4g2 quantization. it works well for resnet-18 on imagenet(top1 acc ~71%). So i want to try if it can work for w1a1(a.k.a BNN). I use the same configuration for W4A4G2, and adjust the bit level to w1a1g3, but the training loss is nan at the every first step. Do you try this method for 1 bit? Can you give me some advise? thx in advance

huqinghao commented 1 year ago

We didn't try binary quantization as binary quantization requires many modifications to the network structure. Maybe you could use BN-BA-Conv structure, leave the downsampling layer as fp32 format, use Adam for the network optimizer, and do other tricks. You could check this repo: https://github.com/HolmesShuan/Training-Tricks-for-Binarized-Neural-Networks

ChuanjunLAN commented 1 year ago

We didn't try binary quantization as binary quantization requires many modifications to the network structure. Maybe you could use BN-BA-Conv structure, leave the downsampling layer as fp32 format, use Adam for the network optimizer, and do other tricks. You could check this repo: https://github.com/HolmesShuan/Training-Tricks-for-Binarized-Neural-Networks

THX for your reply. I have also tried w2a2g2,but i can get only ~52% acc for resnet18 on imagenet, can you give me some advise?

huqinghao commented 1 year ago

The default config (w2a2g2) is supposed to reproduce the ~71% Top1 Accuracy on ImageNet. I have retrained the model, and the results are consistent with the paper. Maybe you could post your training log and args.

ChuanjunLAN commented 1 year ago

The default config (w2a2g2) is supposed to reproduce the ~71% Top1 Accuracy on ImageNet. I have retrained the model, and the results are consistent with the paper. Maybe you could post your training log and args.

configs data: ../dataset/imagenet-1k/ arch: resnet18_quant workers: 20 epochs: 90 start_epoch: 0 batch_size: 256 optimizer_m: SGD optimizer_q: Adam lr_scheduler: cosine lr_m: 0.1 lr_q: 0.0001 lr_m_end: 0 lr_q_end: 0 decay_schedule: 40-80 gamma: 0.1 momentum: 0.9 nesterov: False weight_decay: 0.0001 pretrained: True model: None groups: 2 print_freq: 10 resume: evaluate: False world_size: 1 rank: 0 dist_url: tcp://127.0.0.1:23456 dist_backend: nccl seed: None gpu: None multiprocessing_distributed: True QWeightFlag: True QActFlag: True weight_levels: 2 act_levels: 2 bkwd_scaling_factorW: 0.0 bkwd_scaling_factorA: 0.0 visible_gpus: 0,1,2,3,4,5,6,7 log_dir: ./results/resnet-18/W2A2G2/

here is the args that i use for w2a2g2 training, as for some reason i can't put my logs here, the training loss is ~1.96 and training top1 acc ~53.5%, val top1 acc 51.5%

huqinghao commented 1 year ago

:sweat_smile: 'weight_level:2 and act_levels:2'...... this means that you use binary quantization to quantize activations and weights to 2 values. I explained the weighted levels and activation levels in the README. As I mentioned above, I didn't try binary quantization. BTW, the default training config uses 4 GPUs to train the model, you better scale up your learning rate if you use 8 or more GPUs.