Closed sophieyl820 closed 1 year ago
Did you try --nproc_per_node 1
? Setting CUDA_VISIBLE_DEVICES=0 means the model only sees one GPU, but tries to run on 4
Another thing you'd want to change in order to run on 2 machines, each with 4 GPUs, you'd have to pass the arguments:
--wrapyfi_device_idx 7 --wrapyfi_total_devices 8
Instead of 1&2. This does not follow the fairscale distribution style with ranks and world sizes. Here, each node knows nothing about the others, and only knows how large the world (in this case --wrapyfi_total_devices) is and where to listen. Check out the 4 machine variant in the readme, and use that same approach for larger models/more machines by simply changing the model weights location, --wrapyfi_device_idx 7 and --wrapyfi_total_devices 8, and then for each instance running in a terminal, run the same command, reducing --wrapyfi_device_idx by 1 until you reach 0
Remember to change the cuda visible devices according to the device you'd like the partial LLaMA instance to run on. So if you run the last instance (idx = 7), you can set cuda visible devices to 0. On the other machine, you also want to run, say idx = 3 also on device 0 (here meaning the cuda device), the idx = 2 on cuda device 1 and so on.
Another thing you'd want to change in order to run on 2 machines, each with 4 GPUs, you'd have to pass the arguments:
--wrapyfi_device_idx 7 --wrapyfi_total_devices 8
Instead of 1&2. This does not follow the fairscale distribution style with ranks and world sizes. Here, each node knows nothing about the others, and only knows how large the world (in this case --wrapyfi_total_devices) is and where to listen. Check out the 4 machine variant in the readme, and use that same approach for larger models/more machines by simply changing the model weights location, --wrapyfi_device_idx 7 and --wrapyfi_total_devices 8, and then for each instance running in a terminal, run the same command, reducing --wrapyfi_device_idx by 1 until you reach 0
Remember to change the cuda visible devices according to the device you'd like the partial LLaMA instance to run on. So if you run the last instance (idx = 7), you can set cuda visible devices to 0. On the other machine, you also want to run, say idx = 3 also on device 0 (here meaning the cuda device), the idx = 2 on cuda device 1 and so on.
I ran on 192.168.2.14 machine, but still got same assertion error... if I block the assert line, will come out a load checkpoint error
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='192.168.2.14' torchrun --nproc_per_node 1 example.py --ckpt_dir /storage/workplace/share/llama_models/65B --tokenizer_path ./tokenizer.model --wrapyfi_device_idx 7 --wrapyfi_total_devices 8
I have 2 machines, with 4 GPUs each on 192.168.2.14, I ran
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='192.168.2.14' torchrun --nproc_per_node 4 example.py --ckpt_dir /storage/workplace/share/llama_models/65B --tokenizer_path ./tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 2
and report