RuntimeError: CUDA error: invalid device ordinal

BitCalSaul commented 6 months ago

Thanks very much for your efforts. I used your Runit in my two servers and found it really useful. But I encouted an issue when i used the commit two years ago in your repo. When i run the Runit, there would be an error said "RuntimeError: CUDA error: invalid device ordinal". No matter how I change my script in config.txt, it's still this error. But when i did the same stuff in my another server, it run properly. This is my command: python /home/user/RunIt/run_it.py --interpreter python --verbose --gpu-pool 0 1 --max-workers 2--cmd-pool /home/user/RunIt/ProjCompressor/config.txt This is output:

Namespace(cmd_pool='/home/user/RunIt/ProjCompressor/config.txt', gpu_pool=[0], interpreter='python', max_used_ratio=0.5, max_workers=2, verbose=True)
[YOUR CMDS]
/home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[CREATE PROCESS OBJECTS]
[ID 0 INFO] NEW PROCESS SLOT ON GPU 0 IS CREATED!
[ID 0 INFO] /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[NEW TASK PID: 16243] CUDA_VISIBLE_DEVICES=0 python -u /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6
[ID: 1/1 GPU: 0] Error executing job with overrides: ['epochs=50', 'dividor_value=100000', 'dgroup_id=0', 'dgroups=2', 'model.n_channels=80', 'model.n_blocks=21', 'batch_size=6']
[ID: 1/1 GPU: 0] Traceback (most recent call last):
[ID: 1/1 GPU: 0]   File "/home/user/Compressor/main.py", line 33, in main
[ID: 1/1 GPU: 0]     torch.cuda.set_device(f'cuda:{list(cfg.DDP.gpu)[0]}')
[ID: 1/1 GPU: 0]   File "/data/user/miniconda3/envs/compressor/lib/python3.8/site-packages/torch/cuda/__init__.py", line 404, in set_device
[ID: 1/1 GPU: 0]     torch._C._cuda_setDevice(device)
[ID: 1/1 GPU: 0] RuntimeError: CUDA error: invalid device ordinal
[ID: 1/1 GPU: 0] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[ID: 1/1 GPU: 0] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[ID: 1/1 GPU: 0] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[ID: 1/1 GPU: 0] 
[ID: 1/1 GPU: 0] 
[ID: 1/1 GPU: 0] Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NO MORE COMMANDS, DELETE THE PROCESS SLOT!]
[ALL COMMANDS HAVE BEEN COMPLETED!]

lartpang commented 6 months ago

@BitCalSaul

CUDA_VISIBLE_DEVICES=0 python -u /home/user/Compressor/main.py epochs=50 dividor_value=100000 dgroup_id=0 dgroups=2 model.n_channels=80 model.n_blocks=21 batch_size=6

This is the actual command that is executed.

Perhaps in your code, you manually specified GPUs with non-zero index numbers.

BitCalSaul commented 6 months ago

Yeah it seems like I specify GPU in the code, when I change the index from [1] to [0], it runs properly. Thank you

lartpang / RunIt

RuntimeError: CUDA error: invalid device ordinal #2