pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
3.91k stars 350 forks source link

Non-zero CUDA device does not appear respected #393

Closed rohan-varma closed 4 months ago

rohan-varma commented 6 months ago

Launching a ft with

torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 recipes/memory_efficient_finetune.py --config recipes/configs/alpaca_llama_mem_efficient_ft.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model batch_size=1 optimizer=SGD optim_in_bwd=False metric_logger_type=wandb --device=cuda:2 &> out2 &

And entering a pdb after call to get_model and inspecting self._device yields the following:

(Pdb) print(self._device)
cuda:0

This seems to be because

torchtune.utils.device.get_device does not handle non-zero CUDA ordinals.

kartikayk commented 6 months ago

@rohan-varma I spent some time trying to debug this last week as well and what I found was that irrespective of what CUDA_AVAILABLE_DEVICE is set as, local_rank is always 0 and so get_device actually handles this the right way i.e. the run is kicked off on the right device. I do agree that this is very confusing though. Something that we need to look at while we revamp tune run, @joecummings