Open WencongY opened 1 year ago
Have you fixed this? Thx!
Have you fixed this? Thx!
This is caused by missing parallel group init for torch.distributed
that requires at least:
def setup_model_parallel(rank, master_addr, master_port, world_size, backend='nccl') -> Tuple[int, int]:
'''
this will not work with LightningModule
'''
local_rank = int(os.environ.get("LOCAL_RANK", "0"))
world_size = int(os.environ.get("WORLD_SIZE", "4"))
print("local_rank:", local_rank, "world_size:", world_size)
torch.distributed.init_process_group(backend)
initialize_model_parallel(world_size)
torch.cuda.set_device(local_rank)
# seed must be the same in all processes
torch.manual_seed(1)
return local_rank, world_size
and
import torch.multiprocessing as mp
local_rank, world_size = setup_model_parallel()
mp.spawn(setup_model_parallel, args=(master_addr,master_port,world_size,), nprocs=world_size)
model = Llama3()
Hello team,
I'm trying to run the example.py file with 7B on a single GPU with this command
torchrun --nproc_per_node 1 example.py --ckpt_dir ./llama_model/7B --tokenizer_path ./llama_model/tokenizer.model
, but I've got the following error:Can you please advise how to handle this?
Thanks!