[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
I'm trying to run pretraining with Resnet50 with my data, and running into out-of-memory issues with this.
Initially, I was using two V100s (32 GB) and the maximum batch size I could go to was 256. However, I can't go higher with even larger memory GPUs — I tried using an A100 both 40GB and 80GB, and the maximum batch size I could use without running into out-of-memory issues was still 256.
I'm a bit confused and was wondering if there's a knowledge gap in my understanding; let me know if I'm missing anything!
hi @knightron0, if a batch size of 256 maxes out a 32GB V100, then a 40GB A100 should be similar.
FYI: we use 32 x 80GB A100 in ResNet50 pretraining, with single batch size 128, and that was ok.
I'm trying to run pretraining with Resnet50 with my data, and running into out-of-memory issues with this.
Initially, I was using two V100s (32 GB) and the maximum batch size I could go to was 256. However, I can't go higher with even larger memory GPUs — I tried using an A100 both 40GB and 80GB, and the maximum batch size I could use without running into out-of-memory issues was still 256.
I'm a bit confused and was wondering if there's a knowledge gap in my understanding; let me know if I'm missing anything!