Closed ptits closed 2 weeks ago
According to | distributed init (rank 0): env://, gpu 0
, the inference is only on the first GPU. You might add CUDA_VISIBLE_DEVICES=0,1
at the beginning and see how it goes.
Closed as the issue seems to be solved.
if i run
torchrun --nproc_per_node 2 inference_multigpu.py --temp 5 --model_path "/home/jovyan/Pyramid-Flow/pyramid_flow_model" --sp_group_size 2
i got
[2024-10-13 00:16:18,649] torch.distributed.run: [WARNING] [2024-10-13 00:16:18,649] torch.distributed.run: [WARNING] [2024-10-13 00:16:18,649] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-10-13 00:16:18,649] torch.distributed.run: [WARNING] | distributed init (rank 0): env://, gpu 0 Traceback (most recent call last): File "inference_multigpu.py", line 121, in
main()
File "inference_multigpu.py", line 33, in main
init_distributed_mode(args)
File "/home/jovyan/Pyramid-Flow/trainer_misc/utils.py", line 90, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.[2024-10-13 00:16:23,675] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 91207 closing signal SIGTERM [2024-10-13 00:16:23,839] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 91208) of binary: /opt/conda/envs/pyramid/bin/python3.8 Traceback (most recent call last): File "/opt/conda/envs/pyramid/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
inference_multigpu.py FAILED