RuntimeError: CUDA error: invalid device ordinal

Bailey-24 commented 1 year ago

I ran command python pasture_runner.py -a src.models.agent_fbe_owl -n 8 --arch B32 --center

why this happened? and how to solve?

Traceback (most recent call last):
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 266, in inference_worker
    agent = agent_class(**agent_kwargs)
  File "/home/pi/Desktop/RL_learning/cow/src/models/agent_fbe_owl.py", line 74, in __init__
    center_only=center_only)
  File "/home/pi/Desktop/RL_learning/cow/src/models/localization/clip_owl.py", line 104, in __init__
    self.model = MyOwlViTForObjectDetection.from_pretrained(owl_from_pretrained).eval().to(device)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

pi@pi:~$ nvidia-smi
Mon May 22 14:44:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 00000000:03:00.0  On |                  Off |
| 41%   50C    P0    50W / 150W |   2613MiB /  8192MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro M5000        Off  | 00000000:A1:00.0 Off |                  Off |
| 41%   46C    P8    13W / 150W |     16MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1842      G   /usr/lib/xorg/Xorg                 65MiB |
|    0   N/A  N/A      3181      G   /usr/lib/xorg/Xorg                501MiB |
|    0   N/A  N/A      3381      G   /usr/bin/gnome-shell               73MiB |
|    0   N/A  N/A      5141      G   ...2gtk-4.0/WebKitWebProcess       52MiB |
|    0   N/A  N/A     28551      G   ...RendererForSitePerProcess       45MiB |
|    0   N/A  N/A    235930      G   ...RendererForSitePerProcess      150MiB |
|    0   N/A  N/A    711196      G   ...d-files --enable-crashpad       21MiB |
|    0   N/A  N/A    931804      G   ...mviewer/tv_bin/TeamViewer        4MiB |
|    0   N/A  N/A   3011973      G   ...RendererForSitePerProcess       76MiB |
|    0   N/A  N/A   3258465      G   ...300715944505616879,262144      146MiB |
|    0   N/A  N/A   3268641      G   ...155906284107188537,131072       87MiB |
|    0   N/A  N/A   3489731      G   ...093122278100996567,262144      116MiB |
|    0   N/A  N/A   3544121      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A      1842      G   /usr/lib/xorg/Xorg                  3MiB |
|    1   N/A  N/A      3181      G   /usr/lib/xorg/Xorg                  3MiB |
+-----------------------------------------------------------------------------+

(cow) pi@pi:~/Desktop/RL_learning/cow$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cow) pi@pi:~/Desktop/RL_learning/cow$ python scripts/test_torch_download.py
torch.cuda.is_available(): True
torch.tensor([1]).to(0): tensor([1], device='cuda:0')
Looks good.

I have followed the solution in StackOverflow or GitHub, but it also has the same problem. Is that the cuda vision is not correct? I'm eager to use both two GPUs to run.

sagadre commented 1 year ago

Hi! Thanks for the question and the interest in the work. When developing this code, I was using a machine with 8 GPUs. I just pushed a change to make the code compatible with more machines. See here: https://github.com/columbia-ai-robotics/cow/commit/833f421978c378c1f7b1196ace39e6d226c71090

Note: for a 2 GPU machine, you may also want to try running with -n 2 or -n 4 if you find -n 8 is running into CPU or memory bottlenecks.

Let me know if you are still running into problems and thanks for the issue!

Bailey-24 commented 1 year ago

I have the same problem as issue4 I ran command python pasture_runner.py -a src.models.agent_fbe_owl -n 8 --arch B32 --center, after having the same problem as issue4, I change the timeout from 1000 to 10000, but the result is same.

here is the log after I enter ctrl + c

Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll                                                                                                                 
Process Process-1:                                                                                                                                                                                                 
Process Process-3:                                                                                                                                                                                                 
Process Process-7:                                                                                                                                                                                                 
Process Process-5:                                                                                                                                                                                                 
Process Process-2:                                                                                                                                                                                                 
Process Process-6:                                                                                                                                                                                                 
Process Process-4:                                                                                                                                                                                                 
Process Process-8:                                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                                                             
    self.run()                                                                                                                                                                                                     
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                                                             
    self.run()                                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run                                                                                                                     
    self._target(*self._args, **self._kwargs)                                                                                                                                                                      
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run                                                                                                                     
    self._target(*self._args, **self._kwargs)                                                                                                                                                                      
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                                                             
    self.run()                                                                                                                                                                                                     
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
Traceback (most recent call last):
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
KeyboardInterrupt
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
KeyboardInterrupt
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
KeyboardInterrupt
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
KeyboardInterrupt
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
KeyboardInterrupt
KeyboardInterrupt
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

my computer isn't out of memory

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 00000000:03:00.0  On |                  Off |
| 42%   47C    P5    27W [/](https://file+.vscode-resource.vscode-cdn.net/) 150W |   6109MiB [/](https://file+.vscode-resource.vscode-cdn.net/)  8192MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro M5000        Off  | 00000000:A1:00.0 Off |                  Off |
| 39%   43C    P8    12W [/](https://file+.vscode-resource.vscode-cdn.net/) 150W |   4849MiB [/](https://file+.vscode-resource.vscode-cdn.net/)  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1842      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                 65MiB |
|    0   N/A  N/A      3181      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                530MiB |
|    0   N/A  N/A      3381      G   [/usr/bin/gnome-shell](https://file+.vscode-resource.vscode-cdn.net/usr/bin/gnome-shell)               64MiB |
|    0   N/A  N/A      5141      G   ...2gtk-4.0/WebKitWebProcess       52MiB |
|    0   N/A  N/A     28551      G   ...RendererForSitePerProcess       21MiB |
|    0   N/A  N/A    235930      G   ...RendererForSitePerProcess       10MiB |
|    0   N/A  N/A   3011973      G   ...RendererForSitePerProcess      167MiB |
|    0   N/A  N/A   3258465      G   ...300715944505616879,262144       30MiB |
|    0   N/A  N/A   3268641      G   ...155906284107188537,131072      126MiB |
|    0   N/A  N/A   3489731      G   ...093122278100996567,262144       25MiB |
|    0   N/A  N/A   3798582      G   ...626843.log --shared-files      120MiB |
|    0   N/A  N/A   3823746      C   ...onda3/envs/cow/bin/python     1207MiB |
|    0   N/A  N/A   3823847      C   ...onda3/envs/cow/bin/python     1207MiB |
|    0   N/A  N/A   3823947      C   ...onda3/envs/cow/bin/python     1207MiB |
|    0   N/A  N/A   3824047      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A      1842      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                  3MiB |
|    1   N/A  N/A      3181      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                  3MiB |
|    1   N/A  N/A   3823797      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A   3823895      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A   3823997      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A   3824097      C   ...onda3/envs/cow/bin/python     1207MiB |
+-----------------------------------------------------------------------------+

Bailey-24 commented 1 year ago

I found it very slow to run these two lines

sagadre commented 1 year ago

Are the processes running at all or are the threads locking?

Bailey-24 commented 1 year ago

Yes, the processes running at all, I use -n 1 to debug.

About threads locking. After ask GPT,

In your code, you have separate processes that are interacting with the send_queue and receive_queue. Each process accesses these queues independently, and the Queue implementation handles the necessary synchronization to ensure safe access. Therefore, you don't need to manually handle locks or synchronization between the processes in this particular code snippet. The Queue object takes care of these aspects for you, allowing concurrent access from multiple processes without causing conflicts.

so I didn't lock the thread manually.

Bailey-24 commented 1 year ago

After running for whole night.