Closed newmanwang closed 2 years ago
Hmm this seems like it is coming from gRPC. Do you have any special setup in your cluster?
I follow the "Local On Premise Cluster (List of nodes)" setup procedure, nothing special, most of the time it works just fine.
ray 1.10.0 in image xxxx.com/xxx/ray:1.10.0
here is the ray.yaml
cluster_name: default
docker:
image: "xxxx.com/xxx/ray:1.10.0" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
- --shm-size=5gb
# ray的ssh远程执行命令不能容忍warning,ray的docker image的Locale和节点上的不一样,导致ssh远程执行命令报locale不一致。这里让head node docker的locale使用普通节点常用
的en_US.utf-8
- --env=LC_ALL=en_US.UTF-8
- --env=LANG=en_US.UTF-8
- --env=TZ=Asia/Shanghai
- -v /home/ray/work:/tmp
- -v /home/ray/logs:/logs
- -v /mnt/ray_train:/data
- -v /mnt/ray_train/train/envs:/home/ray/anaconda3/envs
provider:
type: local
head_ip: 192.168.10.128
worker_ips: [192.168.10.191, 192.168.10.202, 192.168.11.20, 192.168.11.21, 192.168.11.22, 192.168.11.23, 192.168.11.24, 192.168.11.25, 192.168.11.26, 192.168.11.27]
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: root
# You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
# cluster:
# (1) The machine on which `ray up` is executed.
# (2) The head node of the Ray cluster.
#
# The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
# executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
# machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
ssh_private_key: ~/.ssh/id_rsa
min_workers: 10
max_workers: 10
upscaling_speed: 1.0
idle_timeout_minutes: 0.5
file_mounts: {
}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands: []
# List of shell commands to run to set up each nodes.
setup_commands: []
# If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
# work environment on each worker by:
# 1. making sure each worker has access to this file i.e. see the `file_mounts` section
# 2. adding a command here that creates a new conda environment on each node or if the environment already exists,
# it updates it:
# conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
#
# Ray developers:
# you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
# If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
# In that case we'd have to activate that env on each node before running `ray`:
# - conda activate my_venv && ray stop
# - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
- conda activate ray && ray stop
# - conda activate ray && ulimit -c unlimited && RAY_BACKEND_LOG_LEVEL=debug ray start --verbose --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0 --num-cpus 0 --include-dashboard true --object-store-memory=4294967296
- conda activate ray && ulimit -c unlimited && ray start --verbose --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0 --num-cpus 0 --include-dashboard true --object-store-memory=4294967296
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
# If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
# In that case we'd have to activate that env on each node before running `ray`:
# - conda activate my_venv && ray stop
# - ray start --address=$RAY_HEAD_IP:6379
- conda activate ray && ray stop
# - conda activate ray && RAY_BACKEND_LOG_LEVEL=debug ray start --address=$RAY_HEAD_IP:6379 --object-store-memory=4294967296
- conda activate ray && ray start --address=$RAY_HEAD_IP:6379 --object-store-memory=4294967296
Python packages installed in conda env ray(check out the worker_start_ray_commands, ray is started within conda env named ray), worker node and head node share this same conda env ray on NFS:
iohttp 3.7.4.post0
aiohttp-cors 0.7.0
aioredis 1.3.1
aiosignal 1.2.0
ansi2html 1.6.0
argcomplete 1.12.3
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
async-timeout 3.0.1
asynctest 0.13.0
attrs 21.2.0
backcall 0.2.0
bleach 4.1.0
blessed 1.19.0
Brotli 1.0.9
cached-property 1.5.2
cachetools 4.2.4
certifi 2021.10.8
cffi 1.15.0
chardet 4.0.0
charset-normalizer 2.0.7
click 7.1.2
colorful 0.5.4
conda-pack 0.6.0
cryptography 3.2.1
cvxopt 1.2.7
cvxpy 1.0.25
cycler 0.11.0
dash 2.0.0
dash-core-components 2.0.0
dash-html-components 2.0.0
dash-table 5.0.0
debugpy 1.5.1
decorator 5.1.0
defusedxml 0.7.1
Deprecated 1.2.13
dill 0.3.4
ecos 2.0.7.post1
entrypoints 0.3
filelock 3.3.2
Flask 2.0.2
Flask-Compress 1.10.1
fonttools 4.29.0
frozenlist 1.2.0
google-api-core 2.2.2
google-auth 2.3.3
googleapis-common-protos 1.53.0
gpustat 1.0.0b1
GPUtil 1.4.0
graphviz 0.19.1
grpcio 1.41.1
h5py 3.6.0
hdf5plugin 3.2.0
hiredis 2.0.0
idna 3.3
importlib-metadata 4.8.2
importlib-resources 5.4.0
ipykernel 6.4.1
ipympl 0.8.7
ipython 7.29.0
ipython-genutils 0.2.0
ipywidgets 7.6.5
itsdangerous 2.0.1
jedi 0.18.0
Jinja2 3.0.3
jsonpickle 2.0.0
jsonschema 4.2.1
jupyter-client 7.0.6
jupyter-core 4.9.1
jupyter-dash 0.4.0
jupyterlab-pygments 0.1.2
jupyterlab-widgets 1.0.2
kiwisolver 1.3.2
Logbook 1.5.3
lru-dict 1.1.6
MarkupSafe 2.0.1
matplotlib 3.5.1
matplotlib-inline 0.1.2
mistune 0.8.4
msgpack 1.0.2
multidict 5.2.0
multiprocess 0.70.12.2
nbclient 0.5.10
nbconvert 6.4.0
nbformat 5.1.3
nest-asyncio 1.5.1
notebook 6.4.7
numexpr 2.7.3
numpy 1.21.4
nvidia-ml-py3 7.352.0
opencensus 0.8.0
opencensus-context 0.1.2
orjson 3.6.4
osqp 0.6.2.post0
packaging 21.3
pandas 1.3.4
pandocfilters 1.5.0
parso 0.8.2
patsy 0.5.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 8.4.0
pip 21.2.2
plotly 5.4.0
prometheus-client 0.12.0
prompt-toolkit 3.0.20
protobuf 3.19.1
psutil 5.8.0
ptyprocess 0.7.0
py-spy 0.3.10
pyarrow 6.0.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.21
Pygments 2.10.0
PyJWT 1.7.1
pymongo 4.0.1
PyMySQL 1.0.2
pyparsing 3.0.7
pyrsistent 0.18.0
python-dateutil 2.8.2
python-rapidjson 1.5
pytz 2021.3
PyYAML 6.0
pyzmq 22.2.1
qdldl 0.1.5.post0
ray 1.10.0
redis 3.5.3
requests 2.26.0
retrying 1.3.3
rqalpha 4.7.0
rqalpha-mod-convertible 1.2.11
rqalpha-mod-fund 0.0.6
rqalpha-mod-incremental 0.0.5a1
rqalpha-mod-optimizer2 1.0.6
rqalpha-mod-option 1.1.14
rqalpha-mod-ricequant-data 2.3.4
rqalpha-mod-rqfactor 1.0.10
rqalpha-mod-spot 1.0.8
rqalpha-plus 4.1.23
rqamsc 0.0.2.post8
rqdatac 2.9.42
rqdatac-fund 1.0.24
rqfactor 1.2.0
rqoptimizer 1.2.14
rqoptimizer2 1.2.14
rqrisk 0.0.14
rqsdk 1.3.11
rsa 4.7.2
scipy 1.7.3
scs 2.1.1.post2
Send2Trash 1.8.0
setuptools 58.0.4
simplejson 3.17.6
six 1.16.0
SQLAlchemy 1.3.24
statsmodels 0.12.1
TA-Lib 0.4.17
tables 3.6.1
tabulate 0.8.9
tenacity 8.0.1
terminado 0.12.1
testpath 0.5.0
tornado 6.1
tqdm 4.62.3
traitlets 5.1.1
typing-extensions 3.10.0.2
urllib3 1.26.7
wcwidth 0.2.5
webencodings 0.5.1
Werkzeug 2.0.2
wheel 0.37.0
widgetsnbextension 3.5.2
wrapt 1.13.3
xgboost 1.5.0
yarl 1.7.2
zipp 3.6.0
How common do you see this? Also cc @scv119
Only once
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
Search before asking
Ray Component
Ray Core
What happened + What you expected to happen
raylet at worker node crashed with SIGSEGV signal
Versions / Dependencies
ray 1.10.0
Reproduction script
Have no idea how to reproduce
Anything else
No response
Are you willing to submit a PR?