Open yutaizhou opened 4 years ago
@micafan it looks like the check is failing here and you were the last to edit this file. Can you take a look?
I'd like to emphasize that not all worker nodes fail. Most I have gotten spun up is 7 (out of 16 total nodes )
From what I understand this is due to the GCS server and Redis dying at different times. (Their lifetimes were expected to be the same.) I'm not too familiar with the problem though. CC @wumuzi520 perhaps?
@micafan it looks like the check is failing here and you were the last to edit this file. Can you take a look?
I didn't add this line. But I think this line is reasonable. You shouldn't use RedisAsyncContext
after you called the method ResetRawRedisAsyncContext
. RedisAsyncContext
is only a wrapper of redisAsyncContext
, add ResetRawRedisAsyncContext
will reset the member ptr redis_async_context_
to nullptr.
@micafan is there a bug in the code?
Also, what should the user do in this case?
@micafan is there a bug in the code?
Also, what should the user do in this case?
I am not familiar with this test case, but it seems GCSClient
already disconnected with Redis
, so the crash here isn't unexpected.
GCSClient
will support reconnect with Redis
in future, by then the problem will be solved.
Is there anything that a user can do aside from waiting for the support? I'd like to run some experiments for a HPC conference paper with 64 nodes (current limit at 16, but I plan to ask for an extension from our HPC team), but if Ray can only spin up 7 out of the 16 I have, it's hard to be optimistic and think "well that's almost half of the allocated nodes, so with 64 as the limit, I should be able to get around 30"
@yutaizhou do you have any logs of redis? (i.e., all of the logs in /tmp/ray/session_latest/logs would be helpful here)
@micafan is there a bug in the code? Also, what should the user do in this case?
I am not familiar with this test case, but it seems
GCSClient
already disconnected withRedis
, so the crash here isn't unexpected.GCSClient
will support reconnect withRedis
in future, by then the problem will be solved.
@micafan, I don't think that is the full story here. The fact that some nodes are able to connect means that Redis/GCS is still alive, right? So why are only some of the nodes failing with that error?
@yutaizhou do you have any logs of redis? (i.e., all of the logs in /tmp/ray/session_latest/logs would be helpful here)
@richardliaw i actually get nothing. The $tmpdir that i pass here is empty, for both nodes that spun up and didn't.
srun --nodes=1 --ntasks=1 -w $node1 unset.sh && ray start --temp-dir=$tmpdir --block --head --redis-port=$port &
tmp/ray/session_latest/logs does not get created, presumably because I already pass in $tmpdir
@yutaizhou This isn't officially documented (so it might change/stop working in the future), but at the moment it should work: could you try running Ray after setting the environment variable GLOG_logtostderr=1
? It should print everything to the terminal.
What is the problem?
When calling ray start on worker nodes in SLURM system, not all worker nodes can be started properly. (In my experience, at most 7 nodes have been started properly, while the most I am allowed to call is 16, as set by my account limit)
Here is output from a raylet ERROR/FATAL file: (both kind of files output the same thing)
Ray version and other system information (Python version, TensorFlow version, OS): Python: 3.6.10 Ray: 0.8.4
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
submit.sh
test.py
If we cannot run your script, we cannot fix your issue.