ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.05k stars 5.59k forks source link

[core_worker] fiber thread stack overflow #46086

Open jjyao opened 2 months ago

jjyao commented 2 months ago

What happened + What you expected to happen

tests/anyscale/json_constrained_decoding/test_e2e.py::test_json_mode[False-v1] INFO 06-17 03:59:58 llm_engine.py:162] Initializing an LLM engine (v0.5.0) with config: model='mistralai/Mistral-7B-Instruct-v0.1', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.1)
INFO 06-17 03:59:59 selector.py:138] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-17 03:59:59 selector.py:50] Using XFormers backend.
INFO 06-17 04:00:01 selector.py:138] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-17 04:00:01 selector.py:50] Using XFormers backend.
INFO 06-17 04:00:02 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 06-17 04:00:05 model_runner.py:160] Loading model weights took 13.4966 GB
INFO 06-17 04:00:05 json_mode_manager.py:138] Use json mode v2: False
2024-06-17 04:00:05,240 INFO worker.py:1585 -- Connecting to existing Ray cluster at address: 10.0.8.107:6379...
2024-06-17 04:00:05,247 INFO worker.py:1761 -- Connected to Ray cluster. View the dashboard at https://session-796bgh5cg3axxvt8zd4veclfsy.i.anyscaleuserdata.com/ 
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa28fe23cef530f3eb571e2d404000000 Worker ID: 4376d2b0a8ca2e1b1be881c7db31d65891f164cad02bd87f9aa1411b Node ID: 09b15f6514ebb7fba19d9abf4453806b0c4b83005414e44cae05c6a6 Worker IP address: 10.0.8.107 Worker port: 10058 Worker PID: 94430 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) *** SIGSEGV received at time=1718622005 on cpu 36 ***
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) PC: @     0x7459b2b7c5c4  (unknown)  boost::fibers::algo::round_robin::pick_next()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430)     @     0x7459b4445420       1472  (unknown)
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430)     @     0x7459b2b7c478         48  boost::fibers::wait_queue::suspend_and_wait()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430)     @     0x7459b2b7b865         64  boost::fibers::mutex::lock()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430)     @     0x7459b2b084e0         96  std::_Function_handler<>::_M_invoke()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430)     @     0x7459b2b00b35         96  boost::fibers::worker_context<>::run_()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430)     @     0x7459b2b008b0         80  boost::context::detail::fiber_entry<>()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430)     @     0x7459b2b7c99f  (unknown)  make_fcontext
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) [2024-06-17 04:00:05,829 E 94430 94485] logging.cc:440: *** SIGSEGV received at time=1718622005 on cpu 36 ***
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) [2024-06-17 04:00:05,829 E 94430 94485] logging.cc:440: PC: @     0x7459b2b7c5c4  (unknown)  boost::fibers::algo::round_robin::pick_next()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) [2024-06-17 04:00:05,829 E 94430 94485] logging.cc:440:     @     0x7459b4445420       1472  (unknown)
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) [2024-06-17 04:00:05,829 E 94430 94485] logging.cc:440:     @     0x7459b2b7c478         48  boost::fibers::wait_queue::suspend_and_wait()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) [2024-06-17 04:00:05,829 E 94430 94485] logging.cc:440:     @     0x7459b2b7b865         64  boost::fibers::mutex::lock()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) [2024-06-17 04:00:05,829 E 94430 94485] logging.cc:440:     @     0x7459b2b084e0         96  std::_Function_handler<>::_M_invoke()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) [2024-06-17 04:00:05,829 E 94430 94485] logging.cc:440:     @     0x7459b2b00b35         96  boost::fibers::worker_context<>::run_()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) [2024-06-17 04:00:05,829 E 94430 94485] logging.cc:440:     @     0x7459b2b008b0         80  boost::context::detail::fiber_entry<>()
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) [2024-06-17 04:00:05,829 E 94430 94485] logging.cc:440:     @     0x7459b2b7c99f  (unknown)  make_fcontext
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430) Fatal Python error: Segmentation fault
(<asyncio.locks.Event object at 0x7459afe48fa0 [unset]> pid=94430)

Versions / Dependencies

a11312b8a9b7a95be5e01cf8dee5cc50022acc6d

Reproduction script

N/A

Issue Severity

None

edoakes commented 2 months ago

@jjyao @hongchaodeng any more context on this issue such as when it was introduced and when it occurs?

hongchaodeng commented 2 months ago

Let me give a more detailed report for the issue:

Root causes:

How to fix it: