ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.12k stars 5.6k forks source link

Ray is failing with: Fatal Python error: Cannot recover from stack overflow. #36375

Closed XavierGeerinck closed 1 year ago

XavierGeerinck commented 1 year ago

What happened + What you expected to happen

When I am using ray and I am trying to start a training task on a local ray environment, it starts up and initially stops with:

Fatal Python error: Cannot recover from stack overflow.
Python runtime state: initialized

I'd love to go more into details to see what is going on, but I am unable to find any logging entries that can clarify why this is happening. The only part I receive is a long stacktrace which I included below.

Looking forward to some help on how I could debug this (and which logging I should maybe enable). I'm available to provide any more details

Fatal Python error: Cannot recover from stack overflow.
Python runtime state: initialized

Thread 0x00007f553c3f7640 (most recent call first):
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_channel.py", line 1258 in channel_spin
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 870 in run
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f5539bf6640 (most recent call first):
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_channel.py", line 1258 in channel_spin
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 870 in run
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f553ebf8640 (most recent call first):
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_channel.py", line 1258 in channel_spin
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 870 in run
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f55413f9640 (most recent call first):
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 306 in wait
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_common.py", line 106 in _wait_once
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_common.py", line 148 in wait
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_channel.py", line 733 in result
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 258 in _poll_locked
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 362 in poll
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 868 in print_logs
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 870 in run
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f5543bfa640 (most recent call first):
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 306 in wait
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_common.py", line 106 in _wait_once
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_common.py", line 148 in wait
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_channel.py", line 733 in result
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 258 in _poll_locked
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 327 in poll
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 1971 in listen_error_messages
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 870 in run
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f55463fb640 (most recent call first):
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 306 in wait
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_common.py", line 106 in _wait_once
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_common.py", line 148 in wait
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/grpc/_channel.py", line 733 in result
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 258 in _poll_locked
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 399 in poll
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/_private/import_thread.py", line 77 in _run
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 870 in run
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/xanrin/.pyenv/versions/3.8.16/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007f562c72cb80 (most recent call first):
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/tune/registry.py", line 270 in get
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/rllib/algorithms/algorithm.py", line 2282 in _get_env_id_and_creator
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/rllib/algorithms/algorithm.py", line 398 in __init__
  File "/home/xanrin/.cache/pypoetry/virtualenvs/python-train-agent-j9Vruttm-py3.8/lib/python3.8/site-packages/ray/rllib/algorithms/algorithm_config.py", line 1071 in build
  File "/home/xanrin/masked/masked/demo/python-train-agent/train_agent/main.py", line 138 in start
  File "<string>", line 1 in <module>
[1]    22480 IOT instruction  RAY_LOG_LEVEL=debug poetry run main

When running ray.init(configure_logging=True, logging_level=logging.DEBUG, log_to_driver=True)` to try to get more logs, I get

my_app|my_app_ray.env.env_manager|Trying to register the environment in Ray...
my_app|utils.actor|actor|__getattr__ for SimManager and method __getstate__
2023-06-13 19:17:47,800 DEBUG gcs_utils.py:342 -- internal_kv_put b'TuneRegistry:01000000:env_creator/my_app' b'\x80\x05\x95\xc1\x01\x00\x00\x00\x00\x00\x00\x8c\x1bray.cloudpickle.cloudpickle\x94\x8c\r_builtin_type\x94\x93\x94\x8c\nMethodType\x94\x85\x94R\x94\x8c\x1dmy_app_ray.env.env_manager\x94\x8c\x15EnvManager.create_env\x94\x93\x94h\x06\x8c\nEnvManager\x94\x93\x94)\x81\x94}\x94\x8c\x07sim_mgr\x94\x8c\x15my_app.utils.actor\x94\x8c\x05Actor\x94\x93\x94)\x81\x94}\x94(\x8c\x08strategy\x94\x8c\x05local\x94\x8c\x06config\x94}\x94(\x8c\x07use_gpu\x94\x89\x8c\x05image\x94\x8c\my_app/test123r\x94u\x8c\x06logger\x94\x8c\x16my_app.utils.logger\x94\x8c\x06Logger\x94\x93\x94)\x81\x94}\x94(\x8c\x06module\x94\x8c\tmy_app\x94\x8c\x04name\x94\x8c\x1dmy_app_ray.sim.sim_manager\x94ub\x8c\x07sim_cls\x94\x8c\x1bmy_app_ray.sim.sim_local\x94\x8c\x08SimLocal\x94\x93\x94ubsb\x86\x94R\x94.' True None
my_app|my_app_ray.env.env_manager|EnvManager registered as 'my_app'
2023-06-13 19:17:47,801 DEBUG gcs_utils.py:300 -- internal_kv_get b'TuneRegistry:01000000:env_creator/my_app' None
2023-06-13 19:17:47,801 DEBUG gcs_utils.py:300 -- internal_kv_get b'TuneRegistry:01000000:env_creator/my_app' None

Versions / Dependencies

Ray: 2.5.0 Python: 3.8.16 OS: Ubuntu 22.04.1 (also happening on Mac latest OS)

Reproduction script

algo = (
    PPOConfig()
    .rollouts(num_rollout_workers=1)
    .resources(num_gpus=0)
    .environment(
        env="my-env",
        env_task_fn=self.curriculum_fn_callback,
    )
    .evaluation(
        evaluation_num_workers=1,
        evaluation_interval=10,
        evaluation_duration=10,
        evaluation_parallel_to_training=False,
        evaluation_config={"explore": False, "render": "human"},
    )
    .callbacks(MyCallbacks)
    .build(use_copy=False)
)

for i in range(training_iters):
    algo.train()

Issue Severity

High: It blocks me from completing my task.

XavierGeerinck commented 1 year ago

this can be closed, main issue is that I was nesting Actors, which ray does not like! Solving this requires serializing the actors correctly by using reference IDs to the actors themselves in the __getstate__,__setstate__ and __deepcopy__ methods of python. Simply using named actors, a wrapper class (hell yeah abstraction!) and some correct magic method implementation to fetch the actor on restore fixed it.