Closed xiezhipeng-git closed 9 months ago
Does this work without Ray? how does PyTorch find GPUs in your WSL setup when ray isn't involved?
Yeah it is unrelated to ray I think. Your driver probably has no visibility to GPU (you may need to set env var CUDA_VISIBLE_DEVICES yourself)
@rkooo567
Yeah it is unrelated to ray I think. Your driver probably has no visibility to GPU (you may need to set env var CUDA_VISIBLE_DEVICES yourself)
I already set env var look my code os.environ['CUDA_VISIBLE_DEVICES']=localGpu_str and print("CUDA_VISIBLE_DEVICES",os.getenv("CUDA_VISIBLE_DEVICES")) result CUDA_VISIBLE_DEVICES 0 prove my env var is success and is use cuda0
Does this work without Ray? how does PyTorch find GPUs in your WSL setup when ray isn't involved?
like os.getenv("CUDA_VISIBLE_DEVICES")) ?
To be clear,
ray.get_gpu_ids()
is not the API for the driver. It is expected to return nothing if you run this outside ray task/actor (inside remote func / class).
Also, the task/actor runs in a different process. So, it is not ideal you declare a device
variable outside the task definition. The driver (python script) runs in a different process from a task (f
in your script).
# Move this inside to the task f
@ray.remote
def f():
device = torch.device('cuda' if torch.cuda.
is_available() else 'cpu')
Lastly, if you specify num_gpus
to ray task or actor, it will automatically set CUDA_VISIBLE_DEVICES
env var for you.
@ray.remote(num_gpus=1)
def f(i):
# This task should have `CUDA_VISIBLE_DEVICES` env var set
需要明确的是,
ray.get_gpu_ids()
不是驱动程序的 API。如果您运行这个外部 ray 任务/actor(在远程函数/类内部),预计不会返回任何内容。此外,任务/参与者在不同的进程中运行。因此,在任务定义之外声明变量并不理想。驱动程序(python脚本)在与任务(在脚本中)不同的进程中运行。
device``f
# Move this inside to the task f @ray.remote def f(): device = torch.device('cuda' if torch.cuda. is_available() else 'cpu')
最后,如果您指定 ray 任务或演员,它将自动为您设置 env var。
num_gpus``CUDA_VISIBLE_DEVICES
@ray.remote(num_gpus=1) def f(i): # This task should have `CUDA_VISIBLE_DEVICES` env var set
thanks,I see device need write into rayfunction.
and if i try @ray.remote(num_cpus = cpu_count(),num_gpus=1)
(raylet) /home/xzpwsl2/.local/lib/python3.10/site-packages/ray/dashboard/agent.py:51: DeprecationWarning: There is no current event loop
(raylet) aiogrpc.init_grpc_aio()
(autoscaler +28s) Tip: use ray status
to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +28s) Warning: The following resource request cannot be scheduled right now: {'CPU': 32.0, 'GPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(autoscaler +1m38s) Warning: The following resource request cannot be scheduled right now: {'CPU': 32.0, 'GPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
[2023-03-29 14:10:49,629] [INFO] [pid=3124][
and run program. it is need 52 times time cost use these code than @ray.remote ('用时5.508474111557007',) only use cpu @ray.remote(num_cpus = cpu_count(),num_gpus=1) def f(i):
device = torch.device('cuda' if torch.cuda.
is_available() else 'cpu')
torch.rand(1, 1).to(device)
time.sleep(1)
# print("@ray.remote","f",2,i)
return i*i
if @ray.remote(num_cpus = cpu_count(),num_gpus=1/cpu_count()) ('用时275.5370864868164',) and if write in ray function device.use @ray.remote it only use cpu is normal?
if use @ray.remote(num_cpus = 1,num_gpus=1/cpu_count())
it raise error
[36mray::f()[39m (pid=31112, ip=172.28.246.219)
File "/home/xzpwsl2/my/work/rlFrame/rl_frame/jorldy/raytest.py", line 39, in f
torch.rand(1, 1).to(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
File "/home/xzpwsl2/my/work/rlFrame/rl_frame/jorldy/raytest.py", line 45, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
and waring
/home/xzpwsl2/.local/lib/python3.10/site-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML
(f pid=14629) warnings.warn("Can't initialize NVML")
but I already install cuda toolkit 118 and 120.pytorch only use 118
@rkooo567 @cadedaniel how can i use gpu with ray?now is ray bugs or my usage method is incorrect?
Yes -- but the get_gpu_ids
function doesn't work from drivers (only in Ray tasks/actors). Need to find some other function or move it into a task/actor.
ray.get_gpu_ids()
@cadedaniel now I see ray.get_gpu_ids() =[]'s reason. then how can i use gpu in ray? I try four kinds ray init function.now if I set gpunum. program error or very slow.
import ray
@ray.remote(num_gpus=1)
def f():
import os
print(os.environ.get("CUDA_VISIBLE_DEVICES"))
ray.get(f.remote())
What does this print?
import ray @ray.remote(num_gpus=1) def f(): import os print(os.environ.get("CUDA_VISIBLE_DEVICES")) ray.get(f.remote())
What does this print?
0 if set num_gpus .it can use gpu.but very slow. use time 315.9938106536865s if i only use cpu。 it is 5s
@rkooo567 - please assign a priority and remove the triage label.
@xiezhipeng-git what's the time you are measuring?
@xiezhipeng-git what's the time you are measuring?
@ray.remote def f(i):
torch.rand(1, 1).to(device)
time.sleep(1)
# print("@ray.remote","f",2,i)
return i*i
bt = time.time() futures = [f.remote(i) for i in range(100)] logger.info(ray.get(futures)) et = time.time() print(f"用时{et-bt}") Adding wait time in computation can effectively simulate the parallel capabilities of GPUs and CPUs
Can we prioritize solving this problem. Because of this issue, my single machine training speed is faster than your large cluster training speed. I can train a 24Gb GPU to solve atari pong in 60 seconds. When using ray, multiple processes will report errors or slow down. This is a very serious problem. Ray is not as good as a single machine at all before gpu support.
I don't think this is a bug. Can you read https://docs.ray.io/en/master/ray-core/tasks/using-ray-with-gpus.html. There are 2 things you should know here.
import ray
import time
from logger import logger
from multiprocessing import cpu_count
import os
import torch
# GPUs and CPUs are auto-detected by ray when you call ray.init()
ray.init()
# -----> this doesn't work because get_gpu_ids cannot be used in a driver.
# print('ray.init()',ray.get_gpu_ids())
@ray.remote(num_gpus=1)
def f(i):
print(ray.get_gpu_ids()))
# Here, CUDA_VISIBLE_DEVICES should be correctly set
print(os.getenv("CUDA_VISIBLE_DEVICES"))
device = torch.device('cuda' if torch.cuda.
is_available() else 'cpu')
torch.rand(1, 1).to(device)
time.sleep(1)
# print("@ray.remote","f",2,i)
return i*i
bt = time.time()
futures = [f.remote(i) for i in range(100)]
logger.info(ray.get(futures))
et = time.time()
logger.info(f"用时{et-bt}")
I don't think this is a bug. Can you read https://docs.ray.io/en/master/ray-core/tasks/using-ray-with-gpus.html. There are 2 things you should know here.
- You should specify num_gpus to ray.remote. Then "inside the actor/task", we automatically set the CUDA_VISIBLE_DEVICES.
- get_gpu_ids doesn't work inside a driver (the python script). It is only available from workers.
import ray import time from logger import logger from multiprocessing import cpu_count import os import torch # GPUs and CPUs are auto-detected by ray when you call ray.init() ray.init() # -----> this doesn't work because get_gpu_ids cannot be used in a driver. # print('ray.init()',ray.get_gpu_ids()) @ray.remote(num_gpus=1) def f(i): print(ray.get_gpu_ids())) # Here, CUDA_VISIBLE_DEVICES should be correctly set print(os.getenv("CUDA_VISIBLE_DEVICES")) device = torch.device('cuda' if torch.cuda. is_available() else 'cpu') torch.rand(1, 1).to(device) time.sleep(1) # print("@ray.remote","f",2,i) return i*i bt = time.time() futures = [f.remote(i) for i in range(100)] logger.info(ray.get(futures)) et = time.time() logger.info(f"用时{et-bt}")
Look at historical information. I have tried multiple GPU configurations. Including gpu=1. The result only slows down and reports an error. Isn't this result a bug?
futures = [f.remote(i) for i in range(100)] is must change to futures = [f.remote(i) for i in range(cpus)] if @ray.remote(num_gpus=1/cpu) ?
It is the user’s responsibility to make sure that the individual tasks don’t use more than their share of the GPU memory. TensorFlow can be configured to limit its memory usage.
If I try
@ray.remote(num_cpus = 1,num_gpus=1/cpus)
futures = [f.remote(i) for i in range(cpus)]
It 用时4.756626605987549
This indicates that the actual problem is that ray will not automatically release the GPU memory and continue running. And the CPU will automatically process it or use meomory very less to not have this problem. I have finally found the specific reason and solution. Thanks. This speed was achieved by reducing the computational load by 68%. But this basically meets expectations. Because the GPU clock frequency is 2.6 and the CPU is 5.8 on my machine(may be not true.because time.sleep(1)) And it has other bug if itry
@ray.remote(num_cpus = 1,num_gpus=1/100)
futures = [f.remote(i) for i in range(100)]
File "python\ray\_raylet.pyx", line 807, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 841, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RaySystemError: System error: Unable to allocate internal buffer.
traceback: Traceback (most recent call last):
File "D:\my\env\python3.10.10\lib\site-packages\ray\_private\serialization.py", line 369, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "D:\my\env\python3.10.10\lib\site-packages\ray\_private\serialization.py", line 252, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File "D:\my\env\python3.10.10\lib\site-packages\ray\_private\serialization.py", line 204, in _deserialize_msgpack_data
msgpack_data, pickle5_data = split_buffer(data)
File "python\ray\includes/serialization.pxi", line 206, in ray._raylet.split_buffer
File "msgpack\_unpacker.pyx", line 372, in msgpack._cmsgpack.Unpacker.__init__
MemoryError: Unable to allocate internal buffer.
why can use 1/32 not can use 1/100. this program use gpu memory almost zero.Is it because it will be limited by the number of CPUs CPU=1? After testing, the minimum allowed number of GPU members cannot be less than num_ gpus=1/cpus
Is it because it will be limited by the number of CPUs CPU=1 If only CPU is used These problems would not exist if only the CPU was used
Hmm probably windows specific issue. I couldn't reproduce it in linux. cc @mattip to follow up
Just to be clear: this code raises a MemoryError using ray 2.3.1 with python3.10 on windows10, but if you replace the 1/100
with 1/32
it works?
import ray
import time
from logger import logger
from multiprocessing import cpu_count
import os
import torch
# GPUs and CPUs are auto-detected by ray when you call ray.init()
ray.init()
# -----> this doesn't work because get_gpu_ids cannot be used in a driver.
# print('ray.init()',ray.get_gpu_ids())
@ray.remote(num_cpus = 1,num_gpus=1/100)
def f(i):
print(ray.get_gpu_ids()))
# Here, CUDA_VISIBLE_DEVICES should be correctly set
print(os.getenv("CUDA_VISIBLE_DEVICES"))
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.rand(1, 1).to(device)
time.sleep(1)
# print("@ray.remote","f",2,i)
return i*i
bt = time.time()
futures = [f.remote(i) for i in range(100)]
logger.info(ray.get(futures))
et = time.time()
logger.info(f"time {et-bt}")
May I ask for more information: What GPU and cuda version you are using? What version of pytorch? How much system memory do you have?
if use ,num_gpus=1/100 and range(100) theprogram and vscode crash then vscode restart and stop in RayTaskError. if num_gpus=1/cpus and range(cpus ) This speed was achieved by reducing the computational load by 68%. time use 4.756 speed is slow between only cpu range(100) 5s NVIDIA GeForce RTX 4090 CUDA Version: 12.1 24GB torch 2.0.0+cu118 system memory 64GB Windows 11 python 3.10.10 And wsl python 3.10.6 has same problem ubuntu 22.04 And python3.9.13 cu118 same NVIDIA GeForce RTX 1070 win10
发生异常: RayTaskError(RuntimeError)
[36mray::f()[39m (pid=9443, ip=172.28.246.219)
File "/home/xzpwsl2/my/work/rlFrame/rl_frame/jorldy/raytest.py", line 47, in f
torch.rand(1, 1).to(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
File "/home/xzpwsl2/my/work/rlFrame/rl_frame/jorldy/raytest.py", line 53, in
After upgrading ray to 2.5.0, this issue has become even more serious.1/32 also do not work. Running the above program using gpu will directly cause the computer Blue screen of death to crash.
This is just the simplest example of ray using GPU. Can we prioritize solving it. As long as it is Windows, the ray GPU function cannot be used. The priority for this should be very high, right?
@mattip are you able to reproduce this?
@xiezhipeng-git can you post a reproducible script for us to try?
你能重现这个吗?
你能发布一个可复制的脚本供我们尝试吗?
Please take a look at the historical news above. There is a replication script inside. Probability of Blue screen of death. 100% RuntimeError: CUDA error: out of memory
import ray import time # from logger import logger from multiprocessing import cpu_count import os import torch # device = torch.device('cuda' if torch.cuda. # is_available() else 'cpu') localGpu_num = torch.cuda.device_count() localGpu_str = str(list(range(localGpu_num))).strip('[]') os.environ['CUDA_VISIBLE_DEVICES']=localGpu_str print("CUDA_VISIBLE_DEVICES",os.getenv("CUDA_VISIBLE_DEVICES")) ray.init(num_cpus=cpu_count(), num_gpus=1) # need=1.0/cpu_count() # @ray.remote(num_gpus=need) # @ray.remote @ray.remote(num_cpus = 1,num_gpus=1/24) # @ray.remote(num_gpus=1) # @ray.remote(num_gpus=1/cpu_count()) def f(i): device = torch.device('cuda' if torch.cuda. is_available() else 'cpu') if i==1 or i==2: print("@ray.remote里:",device,i,torch.rand(1, 1).to(device)) torch.rand(1, 1).to(device) time.sleep(1) return i*i bt = time.time() futures = [f.remote(i) for i in range(100)] print(ray.get(futures)) et = time.time() print(f"用时{et-bt}")
The priority for this should be very high, right?
Yes, the priority should be high, but only if we can reproduce it. Many other people are using ray on windows and not seeing this error. Perhaps you have a hardware or OS problem: ray + torch should not cause a BSOD error even if the task does not complete properly.
I cannot reproduce this. Here is what I did (where test_ray.py
is the script just above, with the syntax errors fixed and changed to use only ascii). Note, as stated above, that CUDA_VISIBLE_DEVICES 0
is expected outside of a ray.remote
call. The loop runs for the expected 100 iterations from 0 to 99:
>python310.exe -mvenv \temp\ray_throwaway
>\temp\ray_throwaway\Scripts\activate
(ray_throwaway) python -m pip install torch --index-url https://download.pytorch.org/whl/cu117
(ray_throwaway) python -m pip install ray==2.5.1
(ray_throwaway) python test.py
(ray_throwaway) d:\pypy_stuff>python \temp\test_ray.py
File "d:\temp\test_ray.py", line 25
if i==1 or i==2:
^
IndentationError: unindent does not match any outer indentation level
(ray_throwaway) d:\pypy_stuff>python \temp\test_ray.py
CUDA_VISIBLE_DEVICES 0
2023-06-29 05:38:01,149 INFO worker.py:1636 -- Started a local Ray instance.
(f pid=15708) @ray.remote: cuda 2
(f pid=15708) tensor([[0.8174]], device='cuda:0')
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]
time 27.490281343460083
(f pid=15672) @ray.remote: cuda 1
(f pid=15672) tensor([[0.9247]], device='cuda:0')
@xiezhipeng-git you need to give us more information in order to help you. Please carefully answer all of the following:
pip list
output?I try these code on some machine and some python version in windows system.There is same error. If I try use gpu. in these
@ray.remote(num_gpus=need)
@ray.remote
@ray.remote(num_cpus = 1,num_gpus=1/24)
@ray.remote(num_gpus=1)
@ray.remote(num_gpus=1/cpu_count())
only cpu can run success it task 5s
- what is the complete output of nvidia-smi?
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.47 Driver Version: 531.68 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off | | 0% 35C P5 27W / 490W| 2040MiB / 24564MiB | 4% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 22 G /Xwayland N/A | +---------------------------------------------------------------------------------------+
* what operating system are you using? win11 and wind11 wsl2 ubuntu-22.04 and same error on windows 10 and wind 10 wsl2 ubuntu-20.04 * what version of python are you using? win10 3.9.13 win10_wsl2 3.9.13 and win11_wsl2 3.10.6 and win11 3.10.10 * what does `pip list` output?
Package Version
absl-py 1.4.0 aiosignal 1.3.1 albumentations 1.3.0 ale-py 0.8.1 anyio 3.6.2 appdirs 1.4.4 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 arrow 1.2.3 astroid 2.15.0 asttokens 2.2.1 asyncio 3.4.3 attrs 22.2.0 backcall 0.2.0 beautifulsoup4 4.12.2 bitmath 1.3.3.1 bleach 6.0.0 blinker 1.6.2 box2d-py 2.3.5 brax 0.9.1 cached-property 1.5.2 cachetools 5.3.0 certifi 2022.12.7 cffi 1.15.1 chardet 4.0.0 charset-normalizer 2.1.1 chex 0.1.7 click 8.1.3 cloudpickle 2.2.1 cmake 3.25.0 colorlog 6.7.0 comm 0.1.3 command-not-found 0.3 concurrent-log-handler 0.9.20 contourpy 1.0.7 cryptography 3.4.8 cupy-cuda12x 12.0.0 cycler 0.11.0 Cython 0.29.33 dbus-python 1.2.18 debugpy 1.6.6 decorator 4.4.2 defusedxml 0.7.1 dill 0.3.6 distlib 0.3.6 distrax 0.1.3 distro 1.7.0 distro-info 1.1build1 dm-env 1.6 dm-tree 0.1.8 docker-pycreds 0.4.0 efficientnet-pytorch 0.7.1 enum-tools 0.9.0.post1 envpool 0.8.2 etils 1.3.0 evdev 1.6.1 evosax 0.1.4 executing 1.2.0 fasteners 0.15 fastjsonschema 2.16.3 fastrlock 0.8.1 filelock 3.10.4 Flask 2.3.2 Flask-Cors 3.0.10 flax 0.6.10 fonttools 4.39.2 fqdn 1.5.1 frozenlist 1.3.3 fsspec 2023.4.0 gast 0.5.4 gitdb 4.0.10 GitPython 3.1.31 glcontext 2.3.7 glfw 1.12.0 google-auth 2.16.3 google-auth-oauthlib 0.4.6 graphviz 0.20.1 grpcio 1.51.3 gym 0.26.2 gym-notices 0.0.8 gym-super-mario-bros 7.4.0 gym3 0.3.3 gymnasium 0.27.1 gymnasium-notices 0.0.1 gymnax 0.0.6 hbutils 0.8.2 httplib2 0.20.2 huggingface-hub 0.13.2 idna 3.4 imageio 2.26.1 imageio-ffmpeg 0.3.0 importlib-metadata 4.6.4 importlib-resources 5.12.0 iniconfig 2.0.0 ipykernel 6.22.0 ipython 8.11.0 ipython-genutils 0.2.0 ipywidgets 8.0.4 isoduration 20.11.0 isort 5.12.0 itsdangerous 2.1.2 jax 0.4.11 jax-jumpy 1.0.0 jaxlib 0.4.11+cuda12.cudnn88 jaxopt 0.7 jedi 0.18.2 jeepney 0.7.1 Jinja2 3.1.2 joblib 1.2.0 jsonpointer 2.3 jsonschema 4.17.3 jupyter 1.0.0 jupyter_client 8.1.0 jupyter-console 6.6.3 jupyter_core 5.3.0 jupyter-events 0.6.3 jupyter_server 2.5.0 jupyter_server_terminals 0.4.4 jupyterlab-pygments 0.2.2 jupyterlab-widgets 3.0.5 kaggle 1.5.13 keyring 23.5.0 kiwisolver 1.4.4 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 lazy_loader 0.1 lazy-object-proxy 1.9.0 lit 15.0.7 lz4 4.3.2 Markdown 3.4.3 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 matplotlib-inline 0.1.6 mccabe 0.7.0 mdurl 0.1.2 mistune 2.0.5 ml-dtypes 0.2.0 mlagents-envs 0.30.0 moderngl 5.8.1 monotonic 1.6 more-itertools 8.10.0 moviepy 1.0.3 mpmath 1.2.1 msgpack 1.0.5 mujoco 2.3.5 mujoco-py 2.1.2.14 munch 2.5.0 nbclassic 0.5.6 nbclient 0.7.4 nbconvert 7.3.1 nbformat 5.8.0 nes-py 8.2.1 nest-asyncio 1.5.6 netifaces 0.11.0 networkx 3.0 notebook 6.5.4 notebook_shim 0.2.3 numpy 1.22.4 nvidia-cublas-cu11 11.11.3.6 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvcc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu11 8.8.1.3 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nvjitlink-cu12 12.1.105 oauthlib 3.2.0 opencv-contrib-python 4.7.0.72 opencv-python 4.7.0.72 opencv-python-headless 4.7.0.72 opt-einsum 3.3.0 optax 0.1.5 optree 0.9.1 orbax-checkpoint 0.2.6 packaging 23.0 pandas 2.0.0 pandocfilters 1.5.0 parallel-execute 0.1.1 parso 0.8.3 pathtools 0.1.2 PettingZoo 1.22.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.3.0 pip 23.1.2 platformdirs 3.1.1 pluggy 1.0.0 portalocker 2.7.0 pretrainedmodels 0.7.4 procgen 0.10.7 proglog 0.1.10 prometheus-client 0.16.0 prompt-toolkit 3.0.38 protobuf 3.20.3 psutil 5.9.4 ptyprocess 0.7.0 pure-eval 0.2.2 py 1.11.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycparser 2.21 pygame 2.3.0 pygifsicle 1.0.7 pyglet 1.5.21 Pygments 2.14.0 PyGObject 3.42.1 PyJWT 2.3.0 pynput 1.7.6 PyOpenGL 3.1.6 pyparsing 2.4.7 pyrsistent 0.19.3 pytest 7.0.1 python-apt 2.4.0+ubuntu1 python-dateutil 2.8.2 python-json-logger 2.0.7 python-slugify 8.0.1 python-xlib 0.33 pytimeparse 1.1.8 pytinyrenderer 0.0.14 pytorch-triton 2.1.0+440fd1bf20 pytz 2022.7.1 PyWavelets 1.4.1 PyYAML 5.4.1 pyzmq 25.0.2 qtconsole 5.4.2 QtPy 2.3.1 qudida 0.0.4 ray 2.5.0 requests 2.28.1 requests-oauthlib 1.3.1 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.3.2 rsa 4.9 scikit-image 0.20.0 scikit-learn 1.2.2 scipy 1.10.1 SecretStorage 3.3.1 segmentation-models-pytorch 0.3.2 Send2Trash 1.8.2 sentry-sdk 1.17.0 setproctitle 1.3.2 setuptools 59.6.0 six 1.16.0 sklearn 0.0.post1 smmap 5.0.0 sniffio 1.3.0 soupsieve 2.4.1 stack-data 0.6.2 support-developer 1.0.5 swig 4.1.1 sympy 1.11.1 systemd-python 234 tensorboard 2.12.0 tensorboard-data-server 0.7.0 tensorboard-plugin-wit 1.8.1 tensorboardX 2.6 tensorflow-probability 0.20.1 tensorstore 0.1.38 termcolor 2.2.0 terminado 0.17.1 text-unidecode 1.3 threadpoolctl 3.1.0 tifffile 2023.3.15 timm 0.6.12 tinycss2 1.2.1 tomli 2.0.1 tomlkit 0.11.6 toolz 0.12.0 torch 2.1.0.dev20230619+cu121 torch-tb-profiler 0.4.1 torchaudio 2.1.0.dev20230619+cu121 torchvision 0.16.0.dev20230619+cu121 tornado 6.2 tqdm 4.65.0 traitlets 5.9.0 treevalue 1.4.10 trimesh 3.9.35 triton 2.0.0 types-protobuf 4.22.0.0 typing_extensions 4.4.0 tzdata 2023.3 ubuntu-advantage-tools 8001 ufw 0.36.1 unattended-upgrades 0.1 uri-template 1.2.0 urllib3 1.26.13 vec-noise 1.1.4 virtualenv 20.21.0 wadllib 1.3.6 wandb 0.14.0 wcwidth 0.2.6 webcolors 1.13 webencodings 0.5.1 websocket-client 1.5.1 Werkzeug 2.3.6 wheel 0.37.1 widgetsnbextension 4.0.5 wrapt 1.15.0 zipp 1.0.0
Edit: (mattip) formatting
You are using a NVIDIA GeForce RTX 4090 which demands a lot of power. Perhaps your power supply or a connection cable is not up to the task and fails when the GPU is fully loaded. Can you run other GPU intensive programs successfully?
You are using a NVIDIA GeForce RTX 4090 which demands a lot of power. Perhaps your power supply or a connection cable is not up to the task and fails when the GPU is fully loaded. Can you run other GPU intensive programs successfully?
I try it in RTX 1070 machine。it is same error。And I can run https://github.com/sail-sg/envpool/blob/aacf06f694ead2eb75331f085f00dad71eec1a08/examples/cleanrl_examples/ppo_atari_envpool.py#L211 this code .After I change some code.I can solve pong in 59S(20 consecutive average scores greater than 17). Is it GPU intensive programs?
I am not sure I understand. You ran the script above, without any changes (using @ray.remote(num_cpus = 1,num_gpus=1/24)
), on two different machines, using windows and ubuntu22.04 and windows-wsl, and it crashed on all of the different machines? It did not run successfully on any machine/os you tried?
发生异常: RayTaskError(RuntimeError) [36mray::f()[39m (pid=3686, ip=172.28.246.219) File "/home/xzpwsl2/my/work/rlFrame/rl_frame/jorldy/raytest.py", line 48, in f torch.rand(1, 1).to(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
File "/home/xzpwsl2/my/work/rlFrame/rl_frame/jorldy/raytest.py", line 54, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
I have two machines in total. One is that Windows 10 has installed wsl2 rtx 1070, and the other is 4090 Windows 11 wsl2. I used Windows Python and wsl2 Python to run this program, respectively. So far, there has only been one successful 1/24 attempt to use GPU on 2.3.1 ray. All other attempts have failed. I am currently trying ray2.5.1 again and it has failed again.
On the successful occasion, I also mentioned that in the historical information of this issue
My successful run has a much smaller list of packages installed. Could you try building a new virtualenv as I did at the start of the report above installing only what is needed to run your script (ray and torch)? Perhaps one of the additional packages is messing up the environment. Here is my pip list
:
(ray_throwaway) d:\pypy_stuff>pip list
Package Version
------------------ -----------
aiosignal 1.3.1
attrs 23.1.0
certifi 2023.5.7
charset-normalizer 3.1.0
click 8.1.3
colorama 0.4.6
filelock 3.9.0
frozenlist 1.3.3
grpcio 1.51.3
idna 3.4
Jinja2 3.1.2
jsonschema 4.17.3
MarkupSafe 2.1.2
mpmath 1.2.1
msgpack 1.0.5
networkx 3.0
numpy 1.25.0
packaging 23.1
pip 22.2.2
protobuf 4.23.3
pyrsistent 0.19.3
PyYAML 6.0
ray 2.5.1
requests 2.31.0
setuptools 63.2.0
sympy 1.11.1
torch 2.0.1+cu117
typing_extensions 4.4.0
urllib3 2.0.3
我成功运行的安装软件包列表要小得多。你能尝试像我在上面的报告开头所做的那样构建一个新的 virtualenv 吗,只安装运行脚本所需的内容(射线和火炬)?也许其中一个额外的软件包正在弄乱环境。这是我的:
pip list
(ray_throwaway) d:\pypy_stuff>pip list Package Version ------------------ ----------- aiosignal 1.3.1 attrs 23.1.0 certifi 2023.5.7 charset-normalizer 3.1.0 click 8.1.3 colorama 0.4.6 filelock 3.9.0 frozenlist 1.3.3 grpcio 1.51.3 idna 3.4 Jinja2 3.1.2 jsonschema 4.17.3 MarkupSafe 2.1.2 mpmath 1.2.1 msgpack 1.0.5 networkx 3.0 numpy 1.25.0 packaging 23.1 pip 22.2.2 protobuf 4.23.3 pyrsistent 0.19.3 PyYAML 6.0 ray 2.5.1 requests 2.31.0 setuptools 63.2.0 sympy 1.11.1 torch 2.0.1+cu117 typing_extensions 4.4.0 urllib3 2.0.3
I don't think it's the mutual influence of the environment. Because my computer is brand new. The first project was to create this ray program. I only handwritten this test program after discovering a memory overflow. I think you should focus on trying out the compatibility between the new Cuda Python version and ray
My successful run has a much smaller list of packages installed. Could you try building a new virtualenv as I did at the start of the report above installing only what is needed to run your script (ray and torch)? Perhaps one of the additional packages is messing up the environment. Here is my
pip list
:(ray_throwaway) d:\pypy_stuff>pip list Package Version ------------------ ----------- aiosignal 1.3.1 attrs 23.1.0 certifi 2023.5.7 charset-normalizer 3.1.0 click 8.1.3 colorama 0.4.6 filelock 3.9.0 frozenlist 1.3.3 grpcio 1.51.3 idna 3.4 Jinja2 3.1.2 jsonschema 4.17.3 MarkupSafe 2.1.2 mpmath 1.2.1 msgpack 1.0.5 networkx 3.0 numpy 1.25.0 packaging 23.1 pip 22.2.2 protobuf 4.23.3 pyrsistent 0.19.3 PyYAML 6.0 ray 2.5.1 requests 2.31.0 setuptools 63.2.0 sympy 1.11.1 torch 2.0.1+cu117 typing_extensions 4.4.0 urllib3 2.0.3
I don't think it's the mutual influence of the environment. Because my computer is brand new. The first project was to create this ray program. I only handwritten this test program after discovering a memory overflow. I think you should focus on trying out the compatibility between the new Cuda Python version and ray
I think you should focus on trying out the compatibility between the new Cuda Python version and ray
I need your help, since only on your machines does the crash happen. Can you tell me what happens if you do these steps:
>python -m venv \temp\ray_throwaway
>\temp\ray_throwaway\Scripts\activate
(ray_throwaway) python -m pip install torch --index-url https://download.pytorch.org/whl/cu117
(ray_throwaway) python -m pip install ray==2.5.1
(ray_throwaway) python test_ray.py
python -m venv \temp\ray_throwaway \temp\ray_throwaway\Scripts\activate (ray_throwaway) python -m pip install torch --index-url https://download.pytorch.org/whl/cu117 (ray_throwaway) python -m pip install ray==2.5.1 (ray_throwaway) python test_ray.py
CUDA_VISIBLE_DEVICES 0 2023-06-29 20:07:35,238 INFO worker.py:1636 -- Started a local Ray instance. (f pid=29984) @ray.remote里: cuda 2 (f pid=29984) tensor([[0.9769]], device='cuda:0') [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801] 用时16.433708429336548 win11 4090 use cuda 117 it can success. and windows10 1070cuda117 also can success. It seems has error with torch cuda 118 and cuda121 WHL
It seems has error with torch cuda 118 and cuda121 WHL
Thanks, that is progress!
Edit: I think I can try the newer wheels from https://pytorch.org/get-started/locally/
It seems has error with torch cuda 118 and cuda121 WHL
Thanks, that is progress!
Edit: I think I can try the newer wheels from https://pytorch.org/get-started/locally/
windows python success stable 118 and 117 .but wsl2 cu117 118 121 also cuda memory error.So how can I do something to slove it on wsl2?
I make a new python .venv on wsl2. It is also cuda memory error. Besides https://github.com/sail-sg/envpool/blob/aacf06f694ead2eb75331f085f00dad71eec1a08/examples/cleanrl_examples/ppo_atari_envpool.py#L211 The running speed of this code has decreased significantly.Because I use Nightly version before today.The Nightly version is much faster than the standard version. Then, in the vast majority of cases, I work with wsl2. Because envpool currently only has a Linux version.
Add some information. My computer's memory is 32GB * 2, and can only recognize 32GB?
> ray status
Resources
---------------------------------------------------------------
Usage:
32.0/32.0 CPU
0.9984/1.0 GPU
0B/32.10GiB memory
2.70MiB/16.05GiB object_store_memory
update ray to 2.6.0. This problem still exists on wsl2.
And I found when in windows python. Use ray.init(num_gpus=1) can work Use ray.init(num_gpus=-1) can not work @ray.remote(num_cpus=1,num_gpus=oneGpuNeed*1) torch.cuda.device_count()=1 str(list(range(localGpu_num))).strip('[]') = 0 os.environ['CUDA_VISIBLE_DEVICES'] = 0 Can this problem be solved urgently. It is a necessary crash bug and involves all Windows computers using WSL.
This is happening in wsl2? If you use windows (not wls2) it does not happen?
Hi, I'm not sure if this is fixed or was just closed due to inactivity. I'm also running in WSL2.
My code:
>>> import ray
>>> ray.init() 2023-12-20 03:25:04,767 INFO worker.py:1673 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.12', ray_version='2.8.1',
ray_commit='82a8df138fe7fcc5c42536ebf26e8c3665704fee', protocol_version=None)
>>> ray.get_gpu_ids()
[]
Your test script:
import ray
@ray.remote(num_gpus=1)
def f():
import os
print(os.environ.get("CUDA_VISIBLE_DEVICES"))
print(ray.get(f.remote()))
Result of test script:
2023-12-20 03:30:24,028 INFO worker.py:1673 -- Started a local Ray instance.
(autoscaler +6s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +6s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 1.0}. Add suitable node types to
this cluster to resolve this issue.
(autoscaler +41s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +1m16s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
Nvidia-smi
Wed Dec 20 03:37:22 2023 +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06 Driver Version: 545.92 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+================
======| | 0 NVIDIA GeForce RTX 3090
On | 00000000:0A:00.0 On | N/A | |
0% 43C P8 35W / 350W | 1202MiB / 24576MiB | 5% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:0B:00.0 Off | N/A |
| 0% 30C P8 8W / 350W | 47MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=================================================================================
======| | 0 N/A N/A 20 G
/Xwayland N/A | | 0
N/A N/A 20 G /Xwayland N/A |
| 0 N/A N/A 34 G /Xwayland N/A |
| 1 N/A N/A 20 G /Xwayland N/A |
| 1 N/A N/A 20 G /Xwayland N/A |
| 1 N/A N/A 34 G /Xwayland N/A |
+---------------------------------------------------------------------------------------+
What happened + What you expected to happen
set gpu num to ray.but ray connot use gpu
ray.get_gpu_ids()
will always return the empty list when called from the driver. This is because Ray does not manage GPU allocations to the driver process. if use tensor,to("cuda") it is raise RuntimeError: No CUDA GPUs are available error File "/home/xzpwsl2/.local/lib/python3.10/site-packages/torch/cuda/init.py", line 247, in _lazy_init torch._C._cuda_init() RuntimeError: No CUDA GPUs are availableVersions / Dependencies
ray 2.3.1 torch2.0 windows and windowswsl
Reproduction script
Issue Severity
High: It blocks me from completing my task.