ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.31k stars 5.83k forks source link

[Bug] raylet SIGSEGV #22377

Closed newmanwang closed 2 years ago

newmanwang commented 2 years ago

Search before asking

Ray Component

Ray Core

What happened + What you expected to happen

raylet at worker node crashed with SIGSEGV signal

[2022-02-13 11:46:40,549 E 149 189] logging.cc:317: *** SIGSEGV received at time=1644724000 on cpu 5 *** 
[2022-02-13 11:46:40,549 E 149 189] logging.cc:317: PC: @     0x5581e30bd2e6  (unknown)  google::protobuf::MessageLite::ParseFromZeroCopyStream()
[2022-02-13 11:46:40,549 E 149 189] logging.cc:317:     @     0x7fcb2713d980       1504  (unknown)
[2022-02-13 11:46:40,549 E 149 189] logging.cc:317:     @     0x5581e29943e1        256  grpc::GenericDeserialize<>()
[2022-02-13 11:46:40,549 E 149 189] logging.cc:317:     @     0x5581e299f383         96  grpc::ServerInterface::PayloadAsyncRequest<>::FinalizeResult()
[2022-02-13 11:46:40,549 E 149 189] logging.cc:317:     @     0x5581e2ce716a         64  grpc::CompletionQueue::AsyncNextInternal()
[2022-02-13 11:46:40,549 E 149 189] logging.cc:317:     @     0x5581e2bc03c4        160  ray::rpc::GrpcServer::PollEventsFromCompletionQueue()
[2022-02-13 11:46:40,549 E 149 189] logging.cc:317:     @     0x5581e311d060  (unknown)  execute_native_thread_routine 
[2022-02-13 11:46:40,549 E 149 189] logging.cc:317:     @ ... and at least 3 more frames

Versions / Dependencies

ray 1.10.0

Reproduction script

Have no idea how to reproduce

Anything else

No response

Are you willing to submit a PR?

rkooo567 commented 2 years ago

Hmm this seems like it is coming from gRPC. Do you have any special setup in your cluster?

newmanwang commented 2 years ago

I follow the "Local On Premise Cluster (List of nodes)" setup procedure, nothing special, most of the time it works just fine.

ray 1.10.0 in image xxxx.com/xxx/ray:1.10.0

here is the ray.yaml

cluster_name: default

docker:
    image: "xxxx.com/xxx/ray:1.10.0" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536
        - --shm-size=5gb
        # ray的ssh远程执行命令不能容忍warning,ray的docker image的Locale和节点上的不一样,导致ssh远程执行命令报locale不一致。这里让head node docker的locale使用普通节点常用
的en_US.utf-8
        - --env=LC_ALL=en_US.UTF-8
        - --env=LANG=en_US.UTF-8
        - --env=TZ=Asia/Shanghai
        - -v /home/ray/work:/tmp
        - -v /home/ray/logs:/logs
        - -v /mnt/ray_train:/data
        - -v /mnt/ray_train/train/envs:/home/ray/anaconda3/envs

provider:
    type: local
    head_ip: 192.168.10.128

    worker_ips: [192.168.10.191, 192.168.10.202, 192.168.11.20, 192.168.11.21, 192.168.11.22, 192.168.11.23, 192.168.11.24, 192.168.11.25, 192.168.11.26, 192.168.11.27]

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: root
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    ssh_private_key: ~/.ssh/id_rsa

min_workers: 10

max_workers: 10

upscaling_speed: 1.0

idle_timeout_minutes: 0.5

file_mounts: {
}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands: []
    # If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
    # work environment on each worker by:
    #   1. making sure each worker has access to this file i.e. see the `file_mounts` section
    #   2. adding a command here that creates a new conda environment on each node or if the environment already exists,
    #     it updates it:
    #      conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
    #
    # Ray developers:
    # you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - conda activate ray && ray stop
  #  - conda activate ray && ulimit -c unlimited && RAY_BACKEND_LOG_LEVEL=debug ray start --verbose --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml  --dashboard-host 0.0.0.0 --num-cpus 0 --include-dashboard true --object-store-memory=4294967296
    - conda activate ray && ulimit -c unlimited && ray start --verbose --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml  --dashboard-host 0.0.0.0 --num-cpus 0 --include-dashboard true --object-store-memory=4294967296

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - conda activate ray && ray stop
  #  - conda activate ray && RAY_BACKEND_LOG_LEVEL=debug ray start --address=$RAY_HEAD_IP:6379 --object-store-memory=4294967296
    - conda activate ray && ray start --address=$RAY_HEAD_IP:6379 --object-store-memory=4294967296

Python packages installed in conda env ray(check out the worker_start_ray_commands, ray is started within conda env named ray), worker node and head node share this same conda env ray on NFS:

iohttp                    3.7.4.post0
aiohttp-cors               0.7.0
aioredis                   1.3.1
aiosignal                  1.2.0
ansi2html                  1.6.0
argcomplete                1.12.3
argon2-cffi                21.3.0
argon2-cffi-bindings       21.2.0
async-timeout              3.0.1
asynctest                  0.13.0
attrs                      21.2.0
backcall                   0.2.0
bleach                     4.1.0
blessed                    1.19.0
Brotli                     1.0.9
cached-property            1.5.2
cachetools                 4.2.4
certifi                    2021.10.8
cffi                       1.15.0
chardet                    4.0.0
charset-normalizer         2.0.7
click                      7.1.2
colorful                   0.5.4
conda-pack                 0.6.0
cryptography               3.2.1
cvxopt                     1.2.7
cvxpy                      1.0.25
cycler                     0.11.0
dash                       2.0.0
dash-core-components       2.0.0
dash-html-components       2.0.0
dash-table                 5.0.0
debugpy                    1.5.1
decorator                  5.1.0
defusedxml                 0.7.1
Deprecated                 1.2.13
dill                       0.3.4
ecos                       2.0.7.post1
entrypoints                0.3
filelock                   3.3.2
Flask                      2.0.2
Flask-Compress             1.10.1
fonttools                  4.29.0
frozenlist                 1.2.0
google-api-core            2.2.2
google-auth                2.3.3
googleapis-common-protos   1.53.0
gpustat                    1.0.0b1
GPUtil                     1.4.0
graphviz                   0.19.1
grpcio                     1.41.1
h5py                       3.6.0
hdf5plugin                 3.2.0
hiredis                    2.0.0
idna                       3.3
importlib-metadata         4.8.2
importlib-resources        5.4.0
ipykernel                  6.4.1
ipympl                     0.8.7
ipython                    7.29.0
ipython-genutils           0.2.0
ipywidgets                 7.6.5
itsdangerous               2.0.1
jedi                       0.18.0
Jinja2                     3.0.3
jsonpickle                 2.0.0
jsonschema                 4.2.1
jupyter-client             7.0.6
jupyter-core               4.9.1
jupyter-dash               0.4.0
jupyterlab-pygments        0.1.2
jupyterlab-widgets         1.0.2
kiwisolver                 1.3.2
Logbook                    1.5.3
lru-dict                   1.1.6
MarkupSafe                 2.0.1
matplotlib                 3.5.1
matplotlib-inline          0.1.2
mistune                    0.8.4
msgpack                    1.0.2
multidict                  5.2.0
multiprocess               0.70.12.2
nbclient                   0.5.10
nbconvert                  6.4.0
nbformat                   5.1.3
nest-asyncio               1.5.1
notebook                   6.4.7
numexpr                    2.7.3
numpy                      1.21.4
nvidia-ml-py3              7.352.0
opencensus                 0.8.0
opencensus-context         0.1.2
orjson                     3.6.4
osqp                       0.6.2.post0
packaging                  21.3
pandas                     1.3.4
pandocfilters              1.5.0
parso                      0.8.2
patsy                      0.5.1
pexpect                    4.8.0
pickleshare                0.7.5
Pillow                     8.4.0
pip                        21.2.2
plotly                     5.4.0
prometheus-client          0.12.0
prompt-toolkit             3.0.20
protobuf                   3.19.1
psutil                     5.8.0
ptyprocess                 0.7.0
py-spy                     0.3.10
pyarrow                    6.0.1
pyasn1                     0.4.8
pyasn1-modules             0.2.8
pycparser                  2.21
Pygments                   2.10.0
PyJWT                      1.7.1
pymongo                    4.0.1
PyMySQL                    1.0.2
pyparsing                  3.0.7
pyrsistent                 0.18.0
python-dateutil            2.8.2
python-rapidjson           1.5
pytz                       2021.3
PyYAML                     6.0
pyzmq                      22.2.1
qdldl                      0.1.5.post0
ray                        1.10.0
redis                      3.5.3
requests                   2.26.0
retrying                   1.3.3
rqalpha                    4.7.0
rqalpha-mod-convertible    1.2.11
rqalpha-mod-fund           0.0.6
rqalpha-mod-incremental    0.0.5a1
rqalpha-mod-optimizer2     1.0.6
rqalpha-mod-option         1.1.14
rqalpha-mod-ricequant-data 2.3.4
rqalpha-mod-rqfactor       1.0.10
rqalpha-mod-spot           1.0.8
rqalpha-plus               4.1.23
rqamsc                     0.0.2.post8
rqdatac                    2.9.42
rqdatac-fund               1.0.24
rqfactor                   1.2.0
rqoptimizer                1.2.14
rqoptimizer2               1.2.14
rqrisk                     0.0.14
rqsdk                      1.3.11
rsa                        4.7.2
scipy                      1.7.3
scs                        2.1.1.post2
Send2Trash                 1.8.0
setuptools                 58.0.4
simplejson                 3.17.6
six                        1.16.0
SQLAlchemy                 1.3.24
statsmodels                0.12.1
TA-Lib                     0.4.17
tables                     3.6.1
tabulate                   0.8.9
tenacity                   8.0.1
terminado                  0.12.1
testpath                   0.5.0
tornado                    6.1
tqdm                       4.62.3
traitlets                  5.1.1
typing-extensions          3.10.0.2
urllib3                    1.26.7
wcwidth                    0.2.5
webencodings               0.5.1
Werkzeug                   2.0.2
wheel                      0.37.0
widgetsnbextension         3.5.2
wrapt                      1.13.3
xgboost                    1.5.0
yarl                       1.7.2
zipp                       3.6.0
rkooo567 commented 2 years ago

How common do you see this? Also cc @scv119

newmanwang commented 2 years ago

Only once

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!