Open addisonklinke opened 2 years ago
I get the same message at one point, and 0% GPU utilization when using APEX_DDPG-torch on a 4 GPU/128 CPU node. After a few sampling iterations showing 'RUNNING' (and seen on the CPUs via htop), the run crashes with a Dequeue timeout.
@ericl @rkooo567 Considering this a Core issue with a Tune-based reproduction.
Btw the default timeout is 30, so you should experiment with values like 60.
@iycheng maybe you can take a look at this? I think it could be related to our recent gcs changes, or there’s a failure from worker initialization (which could also be related to recent changes)
Any progress on this? I'm having the same issue.
I'm experiencing a similar error in issue #25834
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
I also am having this issue. Ray.init hangs forever. Happens 4/5 times, sometimes it works.
Looks like a P1. I'm putting this into Core team backlog and let's discuss how to fix.
Search before asking
Ray Component
Ray Tune
What happened + What you expected to happen
I am trying to run the official tutorial for PyTorch Lightning. It works fine one a single GPU, but fails when the requested resources per trial are more than one GPU
This is on a single node/machine that has 4 GPUs attached. Based on PyTorch Lightning's trainer, I would expect Ray to be able to distribute trials across all the available GPUs when they are requested as resources
Versions / Dependencies
System
requirements.txt
Reproduction script
tutorial.py
Anything else
Based on this discussion post, I tried setting the environment variable for
RAY_worker_register_timeout_seconds
but it does not fix the issuecc @ericl @rkooo567 @iycheng (from the request on #8890
Are you willing to submit a PR?