🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
E Traceback (most recent call last):
E File "/home/fanli/workspace/accelerate/src/accelerate/test_utils/scripts/external_deps/test_zero3_integration.py", line 52, in <module>
E main()
E File "/home/fanli/workspace/accelerate/src/accelerate/test_utils/scripts/external_deps/test_zero3_integration.py", line 48, in main
E init_torch_dist_then_launch_deepspeed()
E File "/home/fanli/workspace/accelerate/src/accelerate/test_utils/scripts/external_deps/test_zero3_integration.py", line 30, in init_torch_dist_then_launch_deepspeed
E torch.distributed.init_process_group(backend="nccl")
E File "/home/fanli/miniforge3/envs/acc-ut-ww23/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
E return func(*args, **kwargs)
E File "/home/fanli/miniforge3/envs/acc-ut-ww23/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
E func_return = func(*args, **kwargs)
E File "/home/fanli/miniforge3/envs/acc-ut-ww23/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
E default_pg, _ = _new_process_group_helper(
E File "/home/fanli/miniforge3/envs/acc-ut-ww23/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1513, in _new_process_group_helper
E raise RuntimeError("Distributed package doesn't have NCCL built in")
E RuntimeError: Distributed package doesn't have NCCL built in
E /home/fanli/miniforge3/envs/acc-ut-ww23/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:613: UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled
E warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
Pls let me know if I need to add support for other devices as well. @muellerzr @SunMarc
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
What does this PR do?
The command above gives the following error:
Pls let me know if I need to add support for other devices as well. @muellerzr @SunMarc