ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.16k stars 5.61k forks source link

[rllib] [flaky release test] long_running_impala fails with ` _dl_allocate_tls_init: Assertion `listp != NULL' failed!` #32008

Open cadedaniel opened 1 year ago

cadedaniel commented 1 year ago
Inconsistency detected by ld.so: ../elf/dl-tls.c: 517: _dl_allocate_tls_init: Assertion `listp != NULL' failed!
avnishn commented 1 year ago

idt this is still happening

krfricke commented 1 year ago

@cadedaniel this comes up for me at the moment when trying to move CI to Python 3.8.

It consistently comes up in FixedResourceTrialRunnerTest3.testTrialNoCheckpointSave.

Error message:

(_MockTrainer pid=4902) 2023-05-17 08:14:48,215 INFO algorithm_config.py:3399 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution spe
ed as with static-graph mode.                                              
(_MockTrainer pid=4902) 2023-05-17 08:14:48,216 WARNING util.py:68 -- Install gputil for GPU system monitoring.
2023-05-17 08:14:56,248 WARNING worker.py:2007 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot th
e problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff41dc368fa7931b15d6ddbb9a01000000 Worker ID: 0e77b230651b0f688d426c42648aef416aa9c8fafa4627d31ccfc5d4 Node ID: 15666962ba36f882ecde7fb3feca0a1cdbcc52ec060b79526f53fa3e Worker IP address: 172.18.0.3 Worker port: 45825 Worker PI
D: 4984 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some poten
tial root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crash
ed unexpectedly due to SIGSEGV or other unexpected errors.                                                                                            
2023-05-17 08:14:56,249 ERROR trial_runner.py:1499 -- Trial __fake_pending: Error happened when processing _ExecutorEventType.TRAINING_RESULT.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
  File "/ray/python/ray/tune/execution/ray_trial_executor.py", line 1227, in get_next_executor_event
    future_result = ray.get(ready_future)                                                                                                             
  File "/ray/python/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper                                
    return fn(*args, **kwargs)                                             
  File "/ray/python/ray/_private/client_mode_hook.py", line 103, in wrapper 
    return func(*args, **kwargs)                                           
  File "/ray/python/ray/_private/worker.py", line 2525, in get                                                                                            raise value                                                                                                                                       
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: _MockTrainer
        actor_id: 41dc368fa7931b15d6ddbb9a01000000                                                                                                    
        pid: 4984                                                                                                                                     
        namespace: f27b9a0b-dc71-40d8-836c-85afd17db1d1                                                                                                       ip: 172.18.0.3                                                                                                                                
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection 
error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray 
stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.                                                

Worker log

(base) root@ae0e6ad10f8e:/tmp/ray/session_2023-05-17_08-14-39_404699_23990/logs# cat worker-0e77b230651b0f688d426c42648aef416aa9c8fafa4627d31ccfc5d4-01000000-4984.err
:job_id:01000000
Inconsistency detected by ld.so: ../elf/dl-tls.c: 517: _dl_allocate_tls_init: Assertion `listp != NULL' failed!

Core worker log

[2023-05-17 08:14:54,020 I 4984 5000] accessor.cc:611: Received notification for node id = 15666962ba36f882ecde7fb3feca0a1cdbcc52ec060b79526f53fa3e, IsAlive = 1
[2023-05-17 08:14:54,020 I 4984 5000] core_worker.cc:3996: Number of alive nodes:1
[2023-05-17 08:14:54,020 I 4984 4984] event.cc:234: Set ray event level to warning
[2023-05-17 08:14:54,020 I 4984 4984] event.cc:342: Ray Event initialized for CORE_WORKER
[2023-05-17 08:14:54,024 I 4984 4984] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 41dc368fa7931b15d6ddbb9a01000000
[2023-05-17 08:14:54,024 I 4984 4984] direct_actor_task_submitter.cc:237: Connecting to actor 41dc368fa7931b15d6ddbb9a01000000 at worker 0e77b230651b0f688d426c42648aef416aa9c8fafa4627d31ccfc5d4
[2023-05-17 08:14:54,024 I 4984 4984] core_worker.cc:2631: Creating actor: 41dc368fa7931b15d6ddbb9a01000000

Here is the session/logs directory: debug.tgz

SeaOfOcean commented 1 year ago

met same error