Closed aitss2017 closed 1 week ago
I can reproduce this error, I'm going to investigate this.
@aitss2017 It seems Transformers v4.43.3 is already installed in the Docker image provided by Habana: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1
Thus, when installing optimum-habana, it doesn't upgrade it. But Transformers v4.43.4 was released to precisely fix this issue (see the release notes here)
Can you try to run
pip install transformers==4.43.4
before running your script?
Thanks, it is working now! BTW, I tried original Transformers v4.43.3 with deep speed zero2, that is ok.
Another issue, when running evaluation after training, meet below error: [INFO|trainer.py:1832] 2024-09-06 10:32:28,801 >> Num examples = 72 [INFO|trainer.py:1835] 2024-09-06 10:32:28,801 >> Batch size = 2 Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault Internal Error: Received signal - Segmentation fault
I see
RuntimeError: collective nonSFG is not supported during hpu graph capturing
I'm not sure if HPU graphs are compatible with ZeRO-3. It does work without --hpu_graphs_for_inference
.
Another option is to call the script once for training with --do_train
and DeepSpeed ZeRO-3 and without --do_eval
, and a second time for evaluation with --do_eval
and without --do_train
and without DeepSpeed (it's a 7B-parameter model so it fits well on one device at inference).
Got it, thanks!
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 \ python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_clm.py \ --model_name_or_path /DISK0/Mistral-7B-v0.3/ \ --dataset_name /DISK0/alpaca \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 1 \ --do_train \ --do_eval \ --output_dir /tmp/mistral_7b \ --use_habana \ --use_lazy_mode \ --gradient_checkpointing \ --use_hpu_graphs_for_inference \ --throughput_warmup_steps 3 \ --deepspeed ./llama2_ds_zero3_config.json \ --gaudi_config_name gaudi_config.json \ --trust_remote_code True \ --overwrite_output_dir \ --block_size 2048
Expected behavior
Fix issue of RuntimeError: shape '[-1, 0]' is invalid for input of size 134152192