Node exit causes entire job to crash (`Check failed: actor_entry != actor_registry_.end()`)

richardliaw commented 4 years ago

What is the problem?

py 3.7, torch 1.4, linux 18

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

The script is here: ray submit [cluster] cifar_pytorch_example.py --args="--use-gpu --tune --num-epochs 20 --num-workers 4 --address='auto'"

https://github.com/richardliaw/ray/blob/improve_tune_trainable/python/ray/util/sgd/torch/examples/cifar_pytorch_example.py

Cluster yaml:

# An unique identifier for the head node and workers of this cluster.
cluster_name: sgd-pytorch-richard

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
min_workers: 2
initial_workers: 2
max_workers: 2

target_utilization_fraction: 0.9
# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1c

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

head_node:
    InstanceType: p3.8xlarge
    ImageId: ami-0698bcaf8bd9ef56d
    KeyName: anyscale
    InstanceMarketOptions:
        MarketType: spot
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 300

worker_nodes:
    InstanceType: p3.8xlarge
    ImageId: ami-0698bcaf8bd9ef56d
    InstanceMarketOptions:
        MarketType: spot
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 300
        # SpotOptions:
        #     MaxPrice: "9.0"
    #     # Run workers on spot by default. Comment this out to use on-demand.
    #     InstanceMarketOptions:
    #         MarketType: spot

setup_commands:
    # This replaces the standard anaconda Ray installation
    - ray || pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl
    # Uncomment this and the filemount to update the Ray installation with your local Ray code
    - rm -rf ./anaconda3/lib/python3.6/site-packages/ray/util/sgd/
    - cp -rf ~/sgd ./anaconda3/lib/python3.6/site-packages/ray/util/
    - pip install -q tqdm

    # Installing this without -U to make sure we don't replace the existing Ray installation
    - pip install ray[rllib]
    - pip install -U ipdb torch torchvision
    # Install Apex
    # - rm -rf apex || true
    # - git clone https://github.com/NVIDIA/apex && cd apex && pip install -v --no-cache-dir  ./ || true

file_mounts: {
    # This should point to ray/python/ray/util/sgd.
    ~/sgd: ../../../sgd,
}

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# # Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --object-store-memory=1000000000

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --object-store-memory=1000000000

If we cannot run your script, we cannot fix your issue.

== Status ==
Memory usage on this node: 16.4/240.1 GiB
PopulationBasedTraining: 11 checkpoints, 8 perturbs
Resources requested: 8/96 CPUs, 8/12 GPUs, 0.0/705.57 GiB heap, 0.0/1.9 GiB objects
Result logdir: /home/ubuntu/ray_results/TorchTrainable
Number of trials: 4 (2 PAUSED, 2 RUNNING)
+----------------------+----------+-------+----------+--------+------------------+----------+--------+
| Trial name           | status   | loc   |       lr |   iter |   total time (s) |     loss |    acc |
|----------------------+----------+-------+----------+--------+------------------+----------+--------|
| TorchTrainable_00001 | RUNNING  |       | 1.02483  |     10 |          114.486 | 0.551072 | 0.8252 |
| TorchTrainable_00002 | RUNNING  |       | 0.12     |     10 |          113.373 | 0.662958 | 0.7832 |
| TorchTrainable_00003 | PAUSED   |       | 0.144    |     11 |          125.843 | 0.576303 | 0.8112 |
| TorchTrainable_00000 | PAUSED   |       | 0.854022 |     10 |          114.712 | 0.716908 | 0.77   |
+----------------------+----------+-------+----------+--------+------------------+----------+--------+
Number of errored trials: 2
+----------------------+--------------+---------------------------------------------------------------------------------------------------------+
| Trial name           |   # failures | error file                                                                                              |
|----------------------+--------------+---------------------------------------------------------------------------------------------------------|
| TorchTrainable_00001 |            1 | /home/ubuntu/ray_results/TorchTrainable/TorchTrainable_1_lr=0.1_2020-03-28_18-13-226lj8i1r9/error.txt   |
| TorchTrainable_00000 |            1 | /home/ubuntu/ray_results/TorchTrainable/TorchTrainable_0_lr=0.001_2020-03-28_18-13-22yut1rzp0/error.txt |
+----------------------+--------------+---------------------------------------------------------------------------------------------------------+

(pid=81357) 2020-03-28 18:22:56,968 INFO trainable.py:390 -- Checkpoint size is 89463091 bytes
(pid=81326) 2020-03-28 18:22:59,501 ERROR worker.py:426 -- SystemExit was raised from the worker
(pid=81326) Traceback (most recent call last):
(pid=81326)   File "python/ray/_raylet.pyx", line 506, in ray._raylet.task_execution_handler
(pid=81326)   File "python/ray/_raylet.pyx", line 343, in ray._raylet.execute_task
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/function_manager.py", line 398, in load_actor_class
(pid=81326)     job_id, actor_creation_function_descriptor)
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/function_manager.py", line 495, in _load_actor_class_from_gcs
(pid=81326)     actor_class = pickle.loads(pickled_class)
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/util/sgd/__init__.py", line 1, in <module>
(pid=81326)     from ray.util.sgd.torch import TorchTrainer
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/util/sgd/torch/__init__.py", line 10, in <module>
(pid=81326)     from ray.util.sgd.torch.torch_trainer import (TorchTrainer,
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/util/sgd/torch/torch_trainer.py", line 12, in <module>
(pid=81326)     from ray.tune import Trainable
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/tune/__init__.py", line 2, in <module>
(pid=81326)     from ray.tune.tune import run_experiments, run
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/tune/tune.py", line 12, in <module>
(pid=81326)     from ray.tune.trial_runner import TrialRunner
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 13, in <module>
(pid=81326)     from ray.tune.progress_reporter import trial_progress_str
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/tune/progress_reporter.py", line 11, in <module>
(pid=81326)     from tabulate import tabulate
(pid=81326)   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
(pid=81326)   File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
(pid=81326)   File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
(pid=81326)   File "<frozen importlib._bootstrap_external>", line 674, in exec_module
(pid=81326)   File "<frozen importlib._bootstrap_external>", line 764, in get_code
(pid=81326)   File "<frozen importlib._bootstrap_external>", line 832, in get_data
(pid=81326)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 423, in sigterm_handler
(pid=81326)     sys.exit(1)
(pid=81326) SystemExit: 1
(pid=raylet) F0328 18:22:59.502856 76220 node_manager.cc:1089]  Check failed: actor_entry != actor_registry_.end()
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet)     @     0x558c972be8dd  google::LogMessage::Fail()
(pid=raylet)     @     0x558c972bfd4c  google::LogMessage::SendToLog()
(pid=raylet)     @     0x558c972be5b9  google::LogMessage::Flush()
(pid=raylet)     @     0x558c972be7d1  google::LogMessage::~LogMessage()
(pid=raylet)     @     0x558c96f830c9  ray::RayLog::~RayLog()
(pid=raylet)     @     0x558c96d8efb0  ray::raylet::NodeManager::HandleDisconnectedActor()
(pid=raylet)     @     0x558c96d93541  ray::raylet::NodeManager::ProcessDisconnectClientMessage()
(pid=raylet)     @     0x558c96d95b75  ray::raylet::NodeManager::ProcessClientMessage()
(pid=raylet)     @     0x558c96d2447d  _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionIN5boost4asio5local15stream_protocolEEEElPKhEZNS1_6raylet6Raylet12HandleAcceptERKNS3_6system10error_codeEEUlS8_lSA_E0_E9_M_invokeERKSt9_Any_dataS8_lSA_
(pid=raylet)     @     0x558c96f55e0e  ray::ClientConnection<>::ProcessMessage()
(pid=raylet)     @     0x558c96f5628e  ray::ClientConnection<>::ProcessMessageHeader()
(pid=raylet)     @     0x558c96f5427d  boost::asio::detail::read_op<>::operator()()
(pid=raylet)     @     0x558c96f5537a  boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(pid=raylet)     @     0x558c9725186f  boost::asio::detail::scheduler::do_run_one()
(pid=raylet)     @     0x558c972528f1  boost::asio::detail::scheduler::run()
(pid=raylet)     @     0x558c97253792  boost::asio::io_context::run()
(pid=raylet)     @     0x558c96d0773e  main
(pid=raylet)     @     0x7fb88b844b97  __libc_start_main
(pid=raylet)     @     0x558c96d17a01  (unknown)
(pid=81332) 2020-03-28 18:22:59,606 ERROR worker.py:426 -- SystemExit was raised from the worker
(pid=81332) Traceback (most recent call last):
(pid=81332)   File "python/ray/_raylet.pyx", line 506, in ray._raylet.task_execution_handler
(pid=81332)   File "python/ray/_raylet.pyx", line 425, in ray._raylet.execute_task
(pid=81332)   File "python/ray/_raylet.pyx", line 442, in ray._raylet.execute_task
(pid=81332)   File "python/ray/_raylet.pyx", line 443, in ray._raylet.execute_task
(pid=81332)   File "python/ray/_raylet.pyx", line 445, in ray._raylet.execute_task
(pid=81332)   File "python/ray/_raylet.pyx", line 423, in ray._raylet.execute_task.function_executor
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/function_manager.py", line 560, in actor_method_executor
(pid=81332)     method_returns = method(actor, *args, **kwargs)
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/util/sgd/torch/torch_runner.py", line 214, in validate
(pid=81332)     iterator, info=info)
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/util/sgd/torch/training_operator.py", line 272, in validate
(pid=81332)     for batch_idx, batch in enumerate(val_iterator):
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
(pid=81332)     return _MultiProcessingDataLoaderIter(self)
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
(pid=81332)     w.start()
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 105, in start
(pid=81332)     self._popen = self._Popen(self)
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
(pid=81332)     return _default_context.get_context().Process._Popen(process_obj)
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
(pid=81332)     return Popen(process_obj)
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
(pid=81332)     self._launch(process_obj)
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
(pid=81332)     self.pid = os.fork()
(pid=81332)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 423, in sigterm_handler
(pid=81332)     sys.exit(1)
(pid=81332) SystemExit: 1
2020-03-28 18:22:59,901 WARNING worker.py:1072 -- A worker died or was killed while executing task ffffffffffffffffeb124b350200.
(pid=81331) E0328 18:23:00.165781 81395 core_worker.cc:382] Raylet failed. Shutting down.
(pid=81331) Traceback (most recent call last):
(pid=81331)   File "python/ray/_raylet.pyx", line 442, in ray._raylet.execute_task
(pid=81331)   File "python/ray/_raylet.pyx", line 443, in ray._raylet.execute_task
(pid=81331)   File "python/ray/_raylet.pyx", line 445, in ray._raylet.execute_task
(pid=81331)   File "python/ray/_raylet.pyx", line 423, in ray._raylet.execute_task.function_executor
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/function_manager.py", line 568, in actor_method_executor
(pid=81331)     raise e
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/function_manager.py", line 560, in actor_method_executor
(pid=81331)     method_returns = method(actor, *args, **kwargs)
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 138, in train_epoch
(pid=81331)     return super(DistributedTorchRunner, self).train_epoch(**kwargs)
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/util/sgd/torch/torch_runner.py", line 192, in train_epoch
(pid=81331)     train_stats = self.training_operator.train_epoch(iterator, info)
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/util/sgd/torch/training_operator.py", line 167, in train_epoch
(pid=81331)     metrics = self.train_batch(batch, batch_info=batch_info)
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/util/sgd/torch/training_operator.py", line 226, in train_batch
(pid=81331)     output = self.model(features)
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
(pid=81331)     result = self.forward(*input, **kwargs)
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
(pid=81331)     self._sync_params()
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 515, in _sync_params
(pid=81331)     self.broadcast_bucket_size)
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 485, in _distributed_broadcast_coalesced
(pid=81331)     dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
(pid=81331) RuntimeError: NCCL error: unhandled system error, NCCL version 2.4.8
(pid=81331)
(pid=81331) During handling of the above exception, another exception occurred:
(pid=81331)
(pid=81331) Traceback (most recent call last):
(pid=81331)   File "python/ray/_raylet.pyx", line 506, in ray._raylet.task_execution_handler
(pid=81331)   File "python/ray/_raylet.pyx", line 425, in ray._raylet.execute_task
(pid=81331)   File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/utils.py", line 90, in push_error_to_driver
(pid=81331)     worker.core_worker.push_error(job_id, error_type, message, time.time())
(pid=81331)   File "python/ray/_raylet.pyx", line 1146, in ray._raylet.CoreWorker.push_error
(pid=81331)   File "python/ray/_raylet.pyx", line 140, in ray._raylet.check_status
(pid=81331) ray.exceptions.RayletError: The Raylet died with this message: [RayletClient] Connection closed unexpectedly.
(pid=81331)
(pid=81331) During handling of the above exception, another exception occurred:
(pid=81331)
(pid=81331) Traceback (most recent call last):
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/utils.py", line 90, in push_error_to_driver
(pid=81331)     worker.core_worker.push_error(job_id, error_type, message, time.time())
(pid=81331)   File "python/ray/_raylet.pyx", line 1146, in ray._raylet.CoreWorker.push_error
(pid=81331)   File "python/ray/_raylet.pyx", line 140, in ray._raylet.check_status
(pid=81331) ray.exceptions.RayletError: The Raylet died with this message: [RayletClient] Connection closed unexpectedly.
(pid=81331) Exception ignored in: 'ray._raylet.task_execution_handler'
(pid=81331) Traceback (most recent call last):
(pid=81331)   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/utils.py", line 90, in push_error_to_driver
(pid=81331)     worker.core_worker.push_error(job_id, error_type, message, time.time())
(pid=81331)   File "python/ray/_raylet.pyx", line 1146, in ray._raylet.CoreWorker.push_error
(pid=81331)   File "python/ray/_raylet.pyx", line 140, in ray._raylet.check_status
(pid=81331) ray.exceptions.RayletError: The Raylet died with this message: [RayletClient] Connection closed unexpectedly.
(pid=81329) E0328 18:23:00.161756 81384 core_worker.cc:382] Raylet failed. Shutting down.
(pid=81330) E0328 18:23:00.162896 81392 core_worker.cc:382] Raylet failed. Shutting down.
(pid=81322) E0328 18:23:00.160723 81382 core_worker.cc:382] Raylet failed. Shutting down.
E0328 18:23:00.229985 77964 task_manager.cc:254] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.util.sgd.torch.torch_trainer, class_name=TorchTrainable, function_name=train, function_hash=}, task_id=52b52e15c1551ba3a7b8e8450200, job_id=0200, num_args=0, num_returns=2, actor_task_spec={actor_id=a7b8e8450200, actor_caller_id=ffffffffffffffffffffffff0200, actor_counter=1}
2020-03-28 18:23:00,230 ERROR trial_runner.py:521 -- Trial TorchTrainable_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 467, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 381, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 1507, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
E0328 18:23:00.231878 77928 task_manager.cc:254] Task failed: IOError: sent task to dead actor: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.util.sgd.torch.torch_trainer, class_name=TorchTrainable, function_name=stop, function_hash=}, task_id=63333513780ca4d4a7b8e8450200, job_id=0200, num_args=0, num_returns=2, actor_task_spec={actor_id=a7b8e8450200, actor_caller_id=ffffffffffffffffffffffff0200, actor_counter=2}
(pid=78269, ip=172.31.32.221) E0328 18:23:00.309044 78320 task_manager.cc:254] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.util.sgd.torch.distributed_torch_runner, class_name=DistributedTorchRunner, function_name=train_epoch, function_hash=}, task_id=0b1b7e38e7dc8ab1d27765b90200, job_id=0200, num_args=6, num_returns=2, actor_task_spec={actor_id=d27765b90200, actor_caller_id=ffffffffffffffff05fe13330200, actor_counter=3}
E0328 18:23:00.510133 77964 task_manager.cc:236] 3 retries left for task ffffffffffffffffb4ce6d050200, attempting to resubmit.
E0328 18:23:00.510215 77964 core_worker.cc:212] Will resubmit task after a 5000ms delay: Type=ACTOR_CREATION_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.util.sgd.torch.torch_trainer, class_name=TorchTrainable, function_name=__init__, function_hash=94628090179928}, task_id=ffffffffffffffffb4ce6d050200, job_id=0200, num_args=4, num_returns=1, actor_creation_task_spec={actor_id=b4ce6d050200, max_reconstructions=0, max_concurrency=1, is_asyncio_actor=0, is_detached=0}
(pid=78269, ip=172.31.32.221) 2020-03-28 18:23:02,182   WARNING torch_trainer.py:416 -- NCCL error: unhandled system error, NCCL version 2.4.8
2020-03-28 18:23:04,859 WARNING worker.py:1072 -- A worker died or was killed while executing task ffffffffffffffff4f3c7b690200.
F0328 18:23:05.639518 77964 direct_task_transport.cc:167] IOError: 14: failed to connect to all addresses
*** Check failure stack trace: ***
    @     0x7f533815a53d  google::LogMessage::Fail()
    @     0x7f533815b9ac  google::LogMessage::SendToLog()
    @     0x7f533815a219  google::LogMessage::Flush()
    @     0x7f533815a431  google::LogMessage::~LogMessage()
    @     0x7f5337ee5ef9  ray::RayLog::~RayLog()
    @     0x7f5337cdbded  ray::CoreWorkerDirectTaskSubmitter::RetryLeaseRequest()
    @     0x7f5337cde577  _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc23RequestWorkerLeaseReplyEEZNS0_29CoreWorkerDirectTaskSubmitter24RequestNewWorkerIfNeededERKSt5tupleIIiSt6vectorINS0_8ObjectIDESaISC_EENS0_7ActorIDEEEPKNS4_7AddressEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
    @     0x7f5337d1ccf5  ray::rpc::ClientCallImpl<>::OnReplyReceived()
    @     0x7f5337ca4c30  _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x7f533810ad8f  boost::asio::detail::scheduler::do_run_one()
    @     0x7f533810b641  boost::asio::detail::scheduler::run()
    @     0x7f533810c402  boost::asio::io_context::run()
    @     0x7f5337c90550  ray::CoreWorker::RunIOService()
    @     0x7f53880c2f70  execute_native_thread_routine_compat
    @     0x7f5398fd96db  start_thread
    @     0x7f5398d0288f  clone
Shared connection to 34.228.215.12 closed.
Error: Command failed:

  ssh -tt -i /Users/rliaw/Research/ec2/clustercfgs/anyscale_cfgs/anyscale.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_97b4747bb8/eddd1d9a4c/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 ubuntu@34.228.215.12 bash --login -c -i ''"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && python ~/cifar_pytorch_example.py --use-gpu --tune  --num-epochs 20 --num-workers 4 --address='"'"'"'"'"'"'"'"'auto'"'"'"'"'"'"'"'"''"'"''

[ ] I have verified my script runs in a clean environment and reproduces the issue.
[ ] I have verified the issue also occurs with the latest wheels.

richardliaw commented 4 years ago

Hey look, the entire cluster died:

2020-03-28 11:49:19,414 INFO updater.py:264 -- NodeUpdater: i-0bff9d862ad11f9be: Running python ~/cifar_pytorch_example.py --use-gpu --num-epochs 20 --num-workers 4 --address='auto' on 34.228.215.12...
2020-03-28 18:49:21,144 WARNING worker.py:792 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
E0328 18:49:21.674247 81818 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet (num_attempts = 1, num_retries = 5)
E0328 18:49:22.174716 81818 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet (num_attempts = 2, num_retries = 5)
E0328 18:49:22.674959 81818 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet (num_attempts = 3, num_retries = 5)
E0328 18:49:23.175170 81818 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet (num_attempts = 4, num_retries = 5)
F0328 18:49:23.675355 81818 raylet_client.cc:83] Could not connect to socket /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet
*** Check failure stack trace: ***
    @     0x7fbc5da7753d  google::LogMessage::Fail()
    @     0x7fbc5da789ac  google::LogMessage::SendToLog()
    @     0x7fbc5da77219  google::LogMessage::Flush()
    @     0x7fbc5da77431  google::LogMessage::~LogMessage()
    @     0x7fbc5d802ef9  ray::RayLog::~RayLog()
    @     0x7fbc5d6408f3  ray::raylet::RayletConnection::RayletConnection()
    @     0x7fbc5d641239  ray::raylet::RayletClient::RayletClient()
    @     0x7fbc5d5ed2ab  ray::CoreWorker::CoreWorker()
    @     0x7fbc5d55e94e  __pyx_tp_new_3ray_7_raylet_CoreWorker()
    @     0x560d1aac30e5  type_call
    @     0x560d1aa35bcb  _PyObject_FastCallDict
    @     0x560d1aac2f4e  call_function
    @     0x560d1aae794a  _PyEval_EvalFrameDefault
    @     0x560d1aabc62e  _PyEval_EvalCodeWithName
    @     0x560d1aabd1cf  fast_function
    @     0x560d1aac2ed5  call_function
    @     0x560d1aae8715  _PyEval_EvalFrameDefault
    @     0x560d1aabc206  _PyEval_EvalCodeWithName
    @     0x560d1aabd1cf  fast_function
    @     0x560d1aac2ed5  call_function
    @     0x560d1aae8715  _PyEval_EvalFrameDefault
    @     0x560d1aabdcb9  PyEval_EvalCodeEx
    @     0x560d1aabea4c  PyEval_EvalCode
    @     0x560d1ab3ac44  run_mod
    @     0x560d1ab3b041  PyRun_FileExFlags
    @     0x560d1ab3b244  PyRun_SimpleFileExFlags
    @     0x560d1ab3ed24  Py_Main
    @     0x560d1aa0675e  main
    @     0x7fbcbe51fb97  __libc_start_main
    @     0x560d1aaee47b  (unknown)
Shared connection to 34.228.215.12 closed.
Error: Command failed:

robertnishihara commented 4 years ago

Should this be a release blocker?

verbose-void commented 4 years ago

Same here. Did you find a way around this?

richardliaw commented 4 years ago

I think this should be resolved on ray==1.0

ray-project / ray

Node exit causes entire job to crash (`Check failed: actor_entry != actor_registry_.end()`) #7788

What is the problem?

Reproduction (REQUIRED)