Closed Bailey-24 closed 1 year ago
Hi! Thanks for the question and the interest in the work. When developing this code, I was using a machine with 8 GPUs. I just pushed a change to make the code compatible with more machines. See here: https://github.com/columbia-ai-robotics/cow/commit/833f421978c378c1f7b1196ace39e6d226c71090
Note: for a 2 GPU machine, you may also want to try running with -n 2
or -n 4
if you find -n 8
is running into CPU or memory bottlenecks.
Let me know if you are still running into problems and thanks for the issue!
I have the same problem as issue4
I ran command python pasture_runner.py -a src.models.agent_fbe_owl -n 8 --arch B32 --center
, after having the same problem as issue4, I change the timeout from 1000
to 10000
, but the result is same.
here is the log after I enter ctrl + c
Traceback (most recent call last):
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
Process Process-1:
Process Process-3:
Process Process-7:
Process Process-5:
Process Process-2:
Process Process-6:
Process Process-4:
Process Process-8:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
Traceback (most recent call last):
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
controller = ai2thor.controller.Controller(**controller_kwargs)
File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
controller = ai2thor.controller.Controller(**controller_kwargs)
Traceback (most recent call last):
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
self.last_event = self.server.receive()
File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
controller = ai2thor.controller.Controller(**controller_kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
metadata, files = self._recv_message()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
self.last_event = self.server.receive()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
self.server_pipe = open(self.server_pipe_path, "rb")
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
KeyboardInterrupt
Traceback (most recent call last):
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
self.server_pipe = open(self.server_pipe_path, "rb")
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
controller = ai2thor.controller.Controller(**controller_kwargs)
KeyboardInterrupt
File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
controller = ai2thor.controller.Controller(**controller_kwargs)
File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
controller = ai2thor.controller.Controller(**controller_kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
host=host,
KeyboardInterrupt
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
self.last_event = self.server.receive()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
self.last_event = self.server.receive()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
self.last_event = self.server.receive()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
metadata, files = self._recv_message()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
metadata, files = self._recv_message()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
metadata, files = self._recv_message()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
controller = ai2thor.controller.Controller(**controller_kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
self.server_pipe = open(self.server_pipe_path, "rb")
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
self.server_pipe = open(self.server_pipe_path, "rb")
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
self.server_pipe = open(self.server_pipe_path, "rb")
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
KeyboardInterrupt
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
self.last_event = self.server.receive()
KeyboardInterrupt
File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
controller = ai2thor.controller.Controller(**controller_kwargs)
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
metadata, files = self._recv_message()
KeyboardInterrupt
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
self.server_pipe = open(self.server_pipe_path, "rb")
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
self.last_event = self.server.receive()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
metadata, files = self._recv_message()
File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
self.server_pipe = open(self.server_pipe_path, "rb")
KeyboardInterrupt
KeyboardInterrupt
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
my computer isn't out of memory
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro M5000 Off | 00000000:03:00.0 On | Off |
| 42% 47C P5 27W [/](https://file+.vscode-resource.vscode-cdn.net/) 150W | 6109MiB [/](https://file+.vscode-resource.vscode-cdn.net/) 8192MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro M5000 Off | 00000000:A1:00.0 Off | Off |
| 39% 43C P8 12W [/](https://file+.vscode-resource.vscode-cdn.net/) 150W | 4849MiB [/](https://file+.vscode-resource.vscode-cdn.net/) 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1842 G [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg) 65MiB |
| 0 N/A N/A 3181 G [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg) 530MiB |
| 0 N/A N/A 3381 G [/usr/bin/gnome-shell](https://file+.vscode-resource.vscode-cdn.net/usr/bin/gnome-shell) 64MiB |
| 0 N/A N/A 5141 G ...2gtk-4.0/WebKitWebProcess 52MiB |
| 0 N/A N/A 28551 G ...RendererForSitePerProcess 21MiB |
| 0 N/A N/A 235930 G ...RendererForSitePerProcess 10MiB |
| 0 N/A N/A 3011973 G ...RendererForSitePerProcess 167MiB |
| 0 N/A N/A 3258465 G ...300715944505616879,262144 30MiB |
| 0 N/A N/A 3268641 G ...155906284107188537,131072 126MiB |
| 0 N/A N/A 3489731 G ...093122278100996567,262144 25MiB |
| 0 N/A N/A 3798582 G ...626843.log --shared-files 120MiB |
| 0 N/A N/A 3823746 C ...onda3/envs/cow/bin/python 1207MiB |
| 0 N/A N/A 3823847 C ...onda3/envs/cow/bin/python 1207MiB |
| 0 N/A N/A 3823947 C ...onda3/envs/cow/bin/python 1207MiB |
| 0 N/A N/A 3824047 C ...onda3/envs/cow/bin/python 1207MiB |
| 1 N/A N/A 1842 G [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg) 3MiB |
| 1 N/A N/A 3181 G [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg) 3MiB |
| 1 N/A N/A 3823797 C ...onda3/envs/cow/bin/python 1207MiB |
| 1 N/A N/A 3823895 C ...onda3/envs/cow/bin/python 1207MiB |
| 1 N/A N/A 3823997 C ...onda3/envs/cow/bin/python 1207MiB |
| 1 N/A N/A 3824097 C ...onda3/envs/cow/bin/python 1207MiB |
+-----------------------------------------------------------------------------+
I found it very slow to run these two lines
Are the processes running at all or are the threads locking?
Yes, the processes running at all, I use -n 1
to debug.
About threads locking. After ask GPT,
In your code, you have separate processes that are interacting with the send_queue and receive_queue. Each process accesses these queues independently, and the Queue implementation handles the necessary synchronization to ensure safe access. Therefore, you don't need to manually handle locks or synchronization between the processes in this particular code snippet. The Queue object takes care of these aspects for you, allowing concurrent access from multiple processes without causing conflicts.
so I didn't lock the thread manually.
After running for whole night.
Would you please create the docker?
I have the same issue with @Bailey-24
I think I ran the experiment, maybe because I use an 8GPUs machine. But I have another question, how to visual, does it have GUI?
Can it only run with 8GPU? I also want to know about GUI issues.
@Bailey-24 was your only change to switch to an 8 GPU machine? re: GUI script, I will work on one and push it later today
yes, I only change to switch to an 8 GPU machine.
Interesting, will close this issue, but will open a new issue for <8 GPU testing
I ran command
python pasture_runner.py -a src.models.agent_fbe_owl -n 8 --arch B32 --center
why this happened? and how to solve?
I have followed the solution in StackOverflow or GitHub, but it also has the same problem. Is that the cuda vision is not correct? I'm eager to use both two GPUs to run.