Open wanghao-cst opened 1 year ago
did you solve this? my 80G cuda won't support 1 batch size training either. It seems something not right
Facing the same issue. Any comments on this will be very helpful.
for training, 8*40G A100 for batchsize_per_device=8 is ok; for evaluate, 1*32G is ok. 8*32G V100 need a smaller batchsize, e.g. 2.
i don't know if this error is related to the packages version. maybe you can try the following configuration
The torch and torchvision utilized in our environment are torch-1.13.1+cu117 and torchvision-0.14.1+cu117 and you can download the transformers used currently here if you failed to install it from git.
@kq-chen Thanks for the memory requirements information. I think torch-1.13.1 do not support FSDP. We should need torch 2.0.1 as your code uses FSDP implementation. Will you please cross-check again in your environment? Thanks.
@ShramanPramanick yep, I did a double-check on torch version. It's torch-1.13.1, and it support FSDP on full parameters tuning.
When using torch 1.13.1, I receive the following error message: ValueError: FSDP requires PyTorch >= 2.0.1.
With torch 2.0.1, the code runs fine with 8 samples per 40GB GPU. So, I think the main concern of this issue is addressed.
Hi all, I am facing the same issue. I have 8 A100 gpus with 40 GB memory. Even with 2 images in one single gpu I am getting memory error. My torch version is 2.0.1. I am using the following command:
CUDA_VISIBLE_DEVICES=7 accelerate launch --num_processes 1 --main_process_port 23786 mllm/pipeline/finetune.py config/shikra_pretrain_concat3_stage0.py --output_dir test --cfg-options model_args.model_name_or_path=/path/to/publicly/available/shikhra/model --dataloader_num_workers 0 --overwrite_output_dir --per_device_train_batch_size 2
Please let me know if I am doing some mistakes or is there any special configs to deal with the memory issue?
The issue is solved by downgrading torch.
@ShramanPramanick so did you address this problem,I also use torch 2.0,however this code still run out of memory with 8*A100 40G gpu
@nicedoctor Sorry for late response. This out of memory issue happens due to long sequence length of some datasets (typically which have many bounding boxes in the output). If you exclude such datasets from mix_pretrain_concat8.py
, it should work. According to my study, these datasets are: {{_base_.DEFAULT_TRAIN_DATASET.VCR_q_ra}}
,{{_base_.DEFAULT_TRAIN_DATASET.VCR_qc_rac}}
, {{_base_.DEFAULT_TRAIN_DATASET.VCR_qac_r}}
, {{_base_.DEFAULT_TRAIN_DATASET.VQAE_train}}
, {{_base_.DEFAULT_TRAIN_DATASET.VQAX_train}}
.
However, this is my own study, not very such how the authors could train the entire stage one on 8 40G cards.
Encountered the same OOM issue on my side with batchsize_per_device=8 on 8*A100 40G gpus as well. @kq-chen It would be great if the authors can clarify.
Q1: What is the minimum cuda memory requirement in training? Q2: Does the raw training script support Deepspeed? It seems 24G cuda memory is not enough in 1 batch size training.