ptits commented 2 weeks ago

if i run

torchrun --nproc_per_node 2 inference_multigpu.py --temp 5 --model_path "/home/jovyan/Pyramid-Flow/pyramid_flow_model" --sp_group_size 2

i got

[2024-10-13 00:16:18,649] torch.distributed.run: [WARNING] [2024-10-13 00:16:18,649] torch.distributed.run: [WARNING] [2024-10-13 00:16:18,649] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-10-13 00:16:18,649] torch.distributed.run: [WARNING] | distributed init (rank 0): env://, gpu 0 Traceback (most recent call last): File "inference_multigpu.py", line 121, in main() File "inference_multigpu.py", line 33, in main init_distributed_mode(args) File "/home/jovyan/Pyramid-Flow/trainer_misc/utils.py", line 90, in init_distributed_mode torch.cuda.set_device(args.gpu) File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/cuda/init.py", line 404, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-10-13 00:16:23,675] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 91207 closing signal SIGTERM [2024-10-13 00:16:23,839] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 91208) of binary: /opt/conda/envs/pyramid/bin/python3.8 Traceback (most recent call last): File "/opt/conda/envs/pyramid/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/pyramid/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

inference_multigpu.py FAILED

feifeiobama commented 2 weeks ago

According to | distributed init (rank 0): env://, gpu 0, the inference is only on the first GPU. You might add CUDA_VISIBLE_DEVICES=0,1 at the beginning and see how it goes.

feifeiobama commented 2 weeks ago

Closed as the issue seems to be solved.

jy0205 / Pyramid-Flow

error with inference_multigpu.py #56

inference_multigpu.py FAILED