Need help... OOM with 2 RTX3090 (bs=2)

the accelerate yaml file

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp8
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And my command for tuning

accelerate launch tuning_e4t.py --pretrained_model_name_or_path e4t-diffusion-ffhq-celebahq-v1   --prompt_template "a photo of {placeholder_token}"   --reg_lambda 0.1   --output_dir tune_yann-lecun   --train_image_path "https://engineering.nyu.edu/sites/default/files/styles/square_large_default_1x/public/2018-06/yann-lecun.jpg?h=65172a10&itok=NItwgG8z"   --resolution 512   --train_batch_size 2   --learning_rate 1e-6 --scale_lr   --max_train_steps 30

I think I have 48GB VRAM is enough, because you only use 1 A100 in your paper. But why I still get OOM even with batchsize = 2 ?

mkshing / e4t-diffusion

Need help... OOM with 2 RTX3090 (bs=2) #25