Closed 24-solar-terms closed 5 months ago
Hi, please try with the latest DeepSpeed commands, thanks.
Have you solved this problem? I also encountered the same problem. In my case, this happened during training after 2444steps. The dataset contains about 540k customer data.
and I try finetune it using 44(a100-40g) though this script `python -m torch.distributed.run --nnodes=4 \ --node_rank=$RANK \ --nproc_per_node=4 \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ ${MODLE_DIR}/train_mem.py \ --deepspeed ${MODLE_DIR}/scripts/zero3.json \ --model_name_or_path /mnt/chongqinggeminiceph1fs/geminicephfs/pr-training-mt/cwctchen/cwctchen/ckpt/llava-v1.5-7b \ --version v1 \ --data_path /mnt/chongqinggeminiceph1fs/geminicephfs/pr-training-mt/cwctchen/cwctchen/data_filter/mix_540k_ocr_translate_new.json \ --image_folder ${IMAGE_DIR} \ --vision_tower /mnt/chongqinggeminiceph1fs/geminicephfs/pr-training-mt/cwctchen/cwctchen/ckpt/clip-vit-large-patch14-336 \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir /mnt/chongqinggeminiceph1fs/geminicephfs/pr-training-mt/cwctchen/cwctchen/LLava_workspace/checkpoints/checkpoints/llava-v1.5-7b-ocr_translate_task_44card_new \ --num_train_epochs 1 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to none `
I meet the same issue and solved the issue by updating the deepspeed to the latest version through
pip install -U deepspeed
Thanks to Haotian's @haotian-liu constructive suggestion!
I encountered the same issue on my own data. When the batch size is set to 16, there are no problems, but when the batch size is set to 8 and gradient_accumulation_steps is set to 2, it hangs. It also hangs when I add new data. I tried updating the versions of torch, deepspeed, and accelerator, but it did not resolve the issue.
@Echo0125 Maybe check the input sequence length after replacing image placeholder token by real image embeddings
My problem is solved by check input sequence length after replacing image placeholder token by real image embeddings. In my dataset, there is long prompt making total input length longer than max sequence length after replacing image placeholder token by real image embeddings, which leads "../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [70,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed." error
I encountered the same issue on my own data. When the batch size is set to 16, there are no problems, but when the batch size is set to 8 and gradient_accumulation_steps is set to 2, it hangs. It also hangs when I add new data. I tried updating the versions of torch, deepspeed, and accelerator, but it did not resolve the issue.
@Echo0125 Did you manage to solve this issue? I get the same problem where gradient accumulation leads to an error.
have you sloved? I met the same problem
Describe the issue
Hi, when I use my own dataset, roughly 50w data, DDP training with 8 A100 80G, the training hangs and gives the following error:
At the beginning, I thought maybe some corrupt images lead to the error, because I see cuda index error in above message, and traceback show the error in swin transformer, but I checked all images use PIL Image.open and deleted all images with warning, no problem found, the training still stuck. I also check input image tensor size and they are right. I searched many way in community, like use the following environment parameter:
it still didn't work. Then I tried to use 2 GPU training and batch size per device use 1, and print image path to find the stuck data, but I found the data is ok, and I constructed a dataset only contained the 2 images, the training process didn't stuck and worked.
However, when I training on single GPU, it works fine, when I training use other datasets on DDP mode, it works fine. So I think the code is ok and it seems there are some problems in the dataset but since single GPU worked and the dataset once used to training other model before, it seems no problems in the dataset.
I also use the following code at beginning of train.py:
just get the error message:
I'm so confused, and I don't know what can I do next.