Non-zero CUDA device does not appear respected

pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning

BSD 3-Clause "New" or "Revised" License

3.91k stars 350 forks source link

Launching a ft with

torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 recipes/memory_efficient_finetune.py --config recipes/configs/alpaca_llama_mem_efficient_ft.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model batch_size=1 optimizer=SGD optim_in_bwd=False metric_logger_type=wandb --device=cuda:2 &> out2 &

And entering a pdb after call to get_model and inspecting self._device yields the following:

(Pdb) print(self._device)
cuda:0

This seems to be because

torchtune.utils.device.get_device does not handle non-zero CUDA ordinals.

pytorch / torchtune

Non-zero CUDA device does not appear respected #393