Closed nnethercott closed 6 months ago
Should also add this I guess
`Accelerate` version: 0.27.2
- Platform: Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.9.2
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 188.71 GB
- GPU type: NVIDIA L4
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'deepspeed_config_file': '/home/nathaniel/llava/dpo-slerp/zero2.json', 'zero3_init_flag': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Other package versions:
transformers==4.38.1
peft==0.8.2
trl==0.7.11
I don't have experience with DeepSpeed, so I can't really help you here. But I wanted to mention that we're currently adding a PEFT + DS guide to the PEFT docs, maybe you can find something useful in there.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Was this issue solved?
Couldn't find any similar other issues in
accelerate
,peft
, ortrl
so I'm opening one here. When using the DPOTrainer on a single GPU with QLoRA I have no issues, but when I try to run the script with accelerate + deepspeed I keep getting "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!".main.py
zero2.json
accelerate config.yaml
When commenting out
deepspeed = "./zero2.json"
in the TrainingArgs and executing the command below I have no issues;Instead if I run the script above using both the accelerate cli or deepspeed cli I get the same error:
or
both give me the following stack trace:
Stack Trace
Based on the accelerate deepspeed integration guides and other tutorials I've seen I was expecting the switch to deepspeed above to run without the above error.