Closed mlguruz closed 2 years ago
that's a bit odd. looks raylet is crash looping. can you share what's the content of raylet.*.out ?
also you can add RAY_BACKEND_LOG_LEVEL=debug
before ray start to show debug logs.
- RAY_BACKEND_LOG_LEVEL=debug ray start --address=$RAY_HEAD_IP:6379 #--metrics-export-port=6666 --dashboard-agent-grpc-port=7777
Thanks! Let me incorporate all these and come back w/ what I see on my end.
also you can add
RAY_BACKEND_LOG_LEVEL=debug
before ray start to show debug logs.- RAY_BACKEND_LOG_LEVEL=debug ray start --address=$RAY_HEAD_IP:6379 #--metrics-export-port=6666 --dashboard-agent-grpc-port=7777
Hi Chen,
Sorry for the late reply.
After quite a bit of local debugging, I found out it was related to my firewall port issues.
I was seeing those endless raylet restarting logs b/c the firewall on the head node was blocking some of the worker ports.
Once I fixed it, I no longer saw the endless failure logs.
I still occasionally saw raylet failure sometime, but it was kinda flaky and I couldn't replicate at the moment.
The raylet logs, whether it be *out
or *err
, are not helpful for debugging this.
Maybe you guys could improve that a little bit?
I'm happy to close this btw.
Search before asking
Ray Component
Ray Core
Issue Severity
High: It blocks me to complete my task.
What happened + What you expected to happen
I'm trying to run ray on local on-prem cluster with docker. After fighting through lots of other issues, I'm now at a stage where I'm able to see my head and worker nodes. However, I could not run jobs on the worker node, and one issue I keep getting is that I'm seeing constant failures of raylets on worker node b/c of segmentation faults. I put the step-by-step details below. I tried with and without docker, and this doesn't seem to be b/c of docker.
Because the log from c++ side isn't very helpful, it's very hard for me to figure out where these segfaults are coming from... Really appreciate your help!
Versions / Dependencies
Ray == 1.11.0 Python == 3.7.7 OS = ubuntu20.04
Reproduction script
Run
ray up cluster.yaml
, and then init ray withThe
cluster.yaml
is the following:And then, I start to see seg fault logs coming from my worker node on ip 192.168.1.153.
Examples are like below:
This is what I see from the log dir on worker node (i.e. ip = 192.158.1.153)
The err logs are not very helpful. Examples:
I tried this w/ and w/o docker, and I got the same errors.
Can someone please help? Thank you so much!
Anything else
No response
Are you willing to submit a PR?