RuntimeError: shape '[-1, 0]' is invalid for input of size 134152192 for Mistral-7B finetune

aitss2017 commented 1 week ago

System Info

HL-SMI Version: hl-1.17.0-fw-51.3.0        
Driver Version: 1.17.0-28a11ca  
Docker image: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1
deepspeed                   0.14.0+hpu.synapse.v1.17.0
habana_gpu_migration        1.17.0.495
habana-media-loader         1.17.0.495
habana-pyhlml               1.17.0.495
habana_quantization_toolkit 1.17.0.495
habana-torch-dataloader     1.17.0.495
habana-torch-plugin         1.17.0.495
optimum-habana              1.14.0.dev0
torch                       2.3.1a0+git4989238
torch_tb_profiler           0.4.0
torchaudio                  2.3.0+952ea74
torchdata                   0.7.1+5e6f7b7
torchmetrics                1.4.0.post0
torchtext                   0.18.0a0+9bed85d
torchvision                 0.18.1a0+fe70bc8

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 \ python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_clm.py \ --model_name_or_path /DISK0/Mistral-7B-v0.3/ \ --dataset_name /DISK0/alpaca \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 1 \ --do_train \ --do_eval \ --output_dir /tmp/mistral_7b \ --use_habana \ --use_lazy_mode \ --gradient_checkpointing \ --use_hpu_graphs_for_inference \ --throughput_warmup_steps 3 \ --deepspeed ./llama2_ds_zero3_config.json \ --gaudi_config_name gaudi_config.json \ --trust_remote_code True \ --overwrite_output_dir \ --block_size 2048

Expected behavior

Fix issue of RuntimeError: shape '[-1, 0]' is invalid for input of size 134152192

regisss commented 1 week ago

I can reproduce this error, I'm going to investigate this.

regisss commented 1 week ago

@aitss2017 It seems Transformers v4.43.3 is already installed in the Docker image provided by Habana: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1

Thus, when installing optimum-habana, it doesn't upgrade it. But Transformers v4.43.4 was released to precisely fix this issue (see the release notes here)

Can you try to run

pip install transformers==4.43.4

before running your script?

aitss2017 commented 1 week ago

Thanks, it is working now! BTW, I tried original Transformers v4.43.3 with deep speed zero2, that is ok.

aitss2017 commented 1 week ago

Another issue, when running evaluation after training, meet below error: [INFO|trainer.py:1832] 2024-09-06 10:32:28,801 >> Num examples = 72 [INFO|trainer.py:1835] 2024-09-06 10:32:28,801 >> Batch size = 2 Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault

regisss commented 1 week ago

I see

RuntimeError: collective nonSFG is not supported during hpu graph capturing

I'm not sure if HPU graphs are compatible with ZeRO-3. It does work without --hpu_graphs_for_inference.

Another option is to call the script once for training with --do_train and DeepSpeed ZeRO-3 and without --do_eval, and a second time for evaluation with --do_eval and without --do_train and without DeepSpeed (it's a 7B-parameter model so it fits well on one device at inference).

aitss2017 commented 1 week ago

Got it, thanks!

huggingface / optimum-habana