Problems when reproducing results of CIFAR experiments

QYQYQYQYQYQ commented 1 year ago

Hi authors,

I think your paper is very interesting and your clear and compact declaration inspired me a lot. But I met some troubles when running experiments on CIFAR10 datasets. It will be out of memory when only using one gpu. But when I use multi-gpu to run the script, there will be some mistakes as shown in screenshot below. And if I decrease batch size, it will not converge after 'burn epochs'. So how can I reproduce the results on Cifar experiments?

I'll appreciate it if you could offer me some help. Many thanks for your kindness.

tychovdo commented 1 year ago

Hi,

Sure, glad to help! At the moment, I don't think our code supports multi-gpu. However, you should be able to run CIFAR-10 experiments regardless, assuming you have access to a reasonably sized single GPU.

Could you provide a bit more details on how you use the code?

In particular: 1) the exact script/settings/used arguments? 2) how large is your GPU?

Also, please a look in the experimentscripts folder which contains scripts we used for the experiments in the paper. For instance, it might be important to turn on the --use_jvp flag which enables a more memory-efficient gradient computation.

QYQYQYQYQYQ commented 1 year ago

Thanks for your reply!

The argumenrts I used is shown below, which is the first command in /experimentscripts/cifar_wrn.sh in this repository:

CUDA_VISIBLE_DEVICES=0 python classification_image.py --subset_size 50000 --dataset cifar10 --model wrn --approx ggn_kron --n_epochs 200 --batch_size 128 --marglik_batch_size 120 --partial_batch_size 40 --lr 0.1 --n_epochs_burnin 10 --n_hypersteps 140 --n_hypersteps_prior 4 --seed 711 --lr_aug 0.05 --lr_aug_min 0.005 --lr_hyp 0.1 --method avgfunc --n_samples_aug 20 --optimize_aug --use_jvp

And I run it on both A100-40G and 3090-24G which I think will be enough to run this script.

By the way, the parser lr_hyp_min can not be found in the classification_image.py, so I delete it.

QYQYQYQYQYQ commented 1 year ago

Hi authors,

We find that the occupation of the GPU keeps rising, so may there be a Memory leak in optimizer hyperparameters by differentiating marglik? Meanwhile the warning message shows that there may be a Memory leak problem when building the calculation graph, which is as shown below.

Will it do help to solve the problem? Thanks for your patience solving the problem!

b6daf5b26bf59cc83dae26460646192

tychovdo commented 1 year ago

@QYQYQYQYQYQ Do the smaller resnet_8_8 or resnet_8_16 networks run for you? You might want to check out experiment script files used in the paper: https://github.com/tychovdo/lila/blob/main/experimentscripts/cifar_resnet_8_8.sh https://github.com/tychovdo/lila/blob/main/experimentscripts/cifar_resnet_8_16.sh

It might be that cifar_wrn.sh is the largest wide resnet we tried and requires an 80GB A100 (with those settings/batch sizes).

@AlexImmer Do you recall the memory requirements for cifar_wrn.sh?

QYQYQYQYQYQ commented 1 year ago

@tychovdo Thanks for your reply!

The smaller resnet_8_8 and resnet_8_16 can run on 3090-24G.

By the way, I tried the transformed MNIST script shown below. It reported RuntimeError: no valid convolution algorithms available in CuDNN(The screenshot is attached below). But when I tried decrease the batch size from 1000 to 10 it reported CUDA: out of memory.

CUDA_VISIBLE_DEVICES=8 python classification_image.py --method avgfunc --dataset translated_mnist --n_epochs 500 --device cuda --subset_size 20000 --n_samples_aug 31 --save --optimize_aug --approx ggn_kron --batch_size 1000 --seed 1 --model cnn --n_epochs_burnin 10 --marglik_batch_size 256 --n_epochs_burnin 10 --download

So would you offer some advice, for example decrease the batch size, to enable the wrn experiments run on a smaller sized gpu?

1689740597640

tychovdo commented 1 year ago

The last error did never happen for us, but might be due to limit gpu memory. Could you add the —use_jvp flag?

This should enable a more memory-efficient implementation of the (costly) backpropagation of the marginal likelihood approximation.

aleximmer commented 1 year ago

Usually, when Cudnn fails to find a valid convolution, it hints towards an OOM error as well so I believe that is in fact the problem. The WRN experiments were indeed run on A100s with 80GB and the settings are to max out the memory to reduce the number of steps. If you still want to run the experiment, you need to reduce the --marglik_batch_size and --partial_batch_size and correspondingly increase the number of hypersteps --n_hypersteps to account for the smaller accumulation batch size given by the partial_batch_size argument. jvp is enabled according to the cifar10 bash script by default.

Alternatively, I would recommend to use our latest method that is much faster and easier to run with smaller GPUs. See the code and paper. There you can find the script for cifar commands and change the corresponding config line to use a relatively small batch size of maybe 30-50 so it fits in 24GB. Otherwise, you can also decrease the n_samples_aug hyperparameter. Note that these changes might come at a cost of performance though.

tychovdo / lila

Problems when reproducing results of CIFAR experiments #1