shikras / shikra

Other
746 stars 46 forks source link

Cuda memory requirement #21

Open wanghao-cst opened 1 year ago

wanghao-cst commented 1 year ago

Q1: What is the minimum cuda memory requirement in training? Q2: Does the raw training script support Deepspeed? It seems 24G cuda memory is not enough in 1 batch size training.

WizardMx commented 1 year ago

did you solve this? my 80G cuda won't support 1 batch size training either. It seems something not right

ShramanPramanick commented 1 year ago

Facing the same issue. Any comments on this will be very helpful.

kq-chen commented 1 year ago

for training, 8*40G A100 for batchsize_per_device=8 is ok; for evaluate, 1*32G is ok. 8*32G V100 need a smaller batchsize, e.g. 2.

i don't know if this error is related to the packages version. maybe you can try the following configuration

The torch and torchvision utilized in our environment are torch-1.13.1+cu117 and torchvision-0.14.1+cu117 and you can download the transformers used currently here if you failed to install it from git.

ShramanPramanick commented 1 year ago

@kq-chen Thanks for the memory requirements information. I think torch-1.13.1 do not support FSDP. We should need torch 2.0.1 as your code uses FSDP implementation. Will you please cross-check again in your environment? Thanks.

kq-chen commented 1 year ago

@ShramanPramanick yep, I did a double-check on torch version. It's torch-1.13.1, and it support FSDP on full parameters tuning.

ShramanPramanick commented 1 year ago

When using torch 1.13.1, I receive the following error message: ValueError: FSDP requires PyTorch >= 2.0.1. With torch 2.0.1, the code runs fine with 8 samples per 40GB GPU. So, I think the main concern of this issue is addressed.

ASMIftekhar commented 1 year ago

Hi all, I am facing the same issue. I have 8 A100 gpus with 40 GB memory. Even with 2 images in one single gpu I am getting memory error. My torch version is 2.0.1. I am using the following command:

CUDA_VISIBLE_DEVICES=7 accelerate launch --num_processes 1 --main_process_port 23786 mllm/pipeline/finetune.py config/shikra_pretrain_concat3_stage0.py --output_dir test --cfg-options model_args.model_name_or_path=/path/to/publicly/available/shikhra/model --dataloader_num_workers 0 --overwrite_output_dir --per_device_train_batch_size 2

Please let me know if I am doing some mistakes or is there any special configs to deal with the memory issue?

ASMIftekhar commented 1 year ago

The issue is solved by downgrading torch.

nicedoctor commented 1 year ago

@ShramanPramanick so did you address this problem,I also use torch 2.0,however this code still run out of memory with 8*A100 40G gpu

ShramanPramanick commented 1 year ago

@nicedoctor Sorry for late response. This out of memory issue happens due to long sequence length of some datasets (typically which have many bounding boxes in the output). If you exclude such datasets from mix_pretrain_concat8.py, it should work. According to my study, these datasets are: {{_base_.DEFAULT_TRAIN_DATASET.VCR_q_ra}},{{_base_.DEFAULT_TRAIN_DATASET.VCR_qc_rac}}, {{_base_.DEFAULT_TRAIN_DATASET.VCR_qac_r}}, {{_base_.DEFAULT_TRAIN_DATASET.VQAE_train}}, {{_base_.DEFAULT_TRAIN_DATASET.VQAX_train}}.

However, this is my own study, not very such how the authors could train the entire stage one on 8 40G cards.

xfgao commented 1 year ago

Encountered the same OOM issue on my side with batchsize_per_device=8 on 8*A100 40G gpus as well. @kq-chen It would be great if the authors can clarify.