An error occurred during DPO on NVIDIA GPU

yoyo20010808 commented 11 months ago

I have changed some parameters in the training code as instructed, but when I run dpo on 8*A6000, I get these errors. If I understand correctly, habana is only used for hpu training.

Details

Traceback (most recent call last): File "/data1/yoyo/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/finetuning/dpo_pipeline/dpo_clm.py", line 219, in model_args, data_args, training_args, finetune_args = parser.parse_args_into_dataclasses() File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 132, in __init__ File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/optimum/habana/transformers/training_args.py", line 522, in __post_init__ device_is_hpu = self.device.type == "hpu" File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/transformers/training_args.py", line 1901, in device return self._setup_devices File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in __get__ cached = self.fget(obj) File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/optimum/habana/transformers/training_args.py", line 679, in _setup_devices self.distributed_state = GaudiPartialState(cpu=False, backend=self.ddp_backend) File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/optimum/habana/accelerate/state.py", line 83, in __init__ self.device = torch.device("cpu") if cpu else self.default_device File "/root/anaconda3/envs/intel_eft/lib/python3.10/site-packages/optimum/habana/accelerate/state.py", line 123, in default_device import habana_frameworks.torch.hpu as hthpu ModuleNotFoundError: No module named 'habana_frameworks'

This is the training script (I don’t know how to assign --device, I just added this parameter)

Details

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python dpo_clm.py \ --model_name_or_path "/data1/yoyo/intel-extension-for-transformers/data/Mistral-7B-v0.1" \ --output_dir "/data1/yoyo/intel-extension-for-transformers/out/dpo_test" \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 8 \ --learning_rate 5e-4 \ --max_steps 1000 \ --save_steps 10 \ --lora_alpha 16 \ --lora_rank 16 \ --lora_dropout 0.05 \ --dataset_name Intel/orca_dpo_pairs \ --bf16 \ --use_auth_token True \ --use_habana False \ --use_lazy_mode False \ --device "auto"

Also, when I run sft(finetune_neuralchat_v3.py), accelerate is automatically set to cpu

Details

[INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cpu (auto detect) "No device has been set. Use either --use_habana to run on HPU or --no_cuda to run on CPU."

Operating system: CentOS 7 Python: 3.10 torch: 2.1.0 CUDA: 12.2 optimum-habana: 1.9.0 transformers: 4.34.1 accelerate: 0.25.0

lkk12014402 commented 11 months ago

hi,

For nvidia gpu, you don't need install optimum-habana, because the code will check 'is_optimum_habana_available()' for habana device. So you can uninstall this package and don't need set "--use_habana" and "--use_lazy_mode ".
The "DPOTrainer" inherits from huggingface/transformers "Trainer", so the device setting is same with it. if the environment has gpu, the code would check this and use it. If set "--use_cpu", the code will run on cpu.

Thanks~

kevinintel commented 5 months ago

Hi, I will close this issue if you don't have concerns

intel / intel-extension-for-transformers

An error occurred during DPO on NVIDIA GPU #901