ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.04k stars 5.78k forks source link

[Tune] OwnerDiedError in cluster. #31265

Open kong13661 opened 1 year ago

kong13661 commented 1 year ago

What happened + What you expected to happen

I try the code in document Using PyTorch Lightning with Tune.

I start a cluster in by the command:

ray start --head --port $port --dashboard-port 24132 in machine A, and use the command ray start --address=$address:$port in machine B to add the machine to cluster. I run code in machine A.

All trials in machine B raise the error below:

Failure # 1 (occurred at 2022-12-21_12-16-16)
ray::ImplicitFunc.train() (pid=97256, ip=121.48.163.20, repr=train_mnist_tune)
  File "/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File "/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 335, in entrypoint
    return self._trainable_func(
  File "/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 652, in _trainable_func
    output = fn()
  File "/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 394, in _inner
    return inner(config, checkpoint_dir=None)
  File "/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 385, in inner
    fn_kwargs[k] = parameter_registry.get(prefix + k)
  File "/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/tune/registry.py", line 234, in get
    return ray.get(self.references[k])
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0200000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via .remote() or ray.put(). Check cluster logs (`/tmp/ray/session_latest/logs/*02000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 121.48.161.92) for more information about the Python worker failure.

Versions / Dependencies

The version of ray is 2.2.0.

Reproduction script

The code is below:

import math

import torch
import pytorch_lightning as pl
from filelock import FileLock
from torch.utils.data import DataLoader, random_split
from torch.nn import functional as F
from torchvision.datasets import MNIST
from torchvision import transforms
import os

from pytorch_lightning.loggers import TensorBoardLogger
from ray import air, tune
from ray.air import session
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining
from ray.tune.integration.pytorch_lightning import TuneReportCallback, \
    TuneReportCheckpointCallback
import ray

class LightningMNISTClassifier(pl.LightningModule):
    """
    This has been adapted from
    https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09
    """

    def __init__(self, config, data_dir=None):
        super(LightningMNISTClassifier, self).__init__()

        self.data_dir = data_dir or os.getcwd()

        self.layer_1_size = config["layer_1_size"]
        self.layer_2_size = config["layer_2_size"]
        self.lr = config["lr"]
        self.batch_size = config["batch_size"]

        # mnist images are (1, 28, 28) (channels, width, height)
        self.layer_1 = torch.nn.Linear(28 * 28, self.layer_1_size)
        self.layer_2 = torch.nn.Linear(self.layer_1_size, self.layer_2_size)
        self.layer_3 = torch.nn.Linear(self.layer_2_size, 10)

    def forward(self, x):
        batch_size, channels, width, height = x.size()
        x = x.view(batch_size, -1)

        x = self.layer_1(x)
        x = torch.relu(x)

        x = self.layer_2(x)
        x = torch.relu(x)

        x = self.layer_3(x)
        x = torch.log_softmax(x, dim=1)

        return x

    def cross_entropy_loss(self, logits, labels):
        return F.nll_loss(logits, labels)

    def accuracy(self, logits, labels):
        _, predicted = torch.max(logits.data, 1)
        correct = (predicted == labels).sum().item()
        accuracy = correct / len(labels)
        return torch.tensor(accuracy)

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        accuracy = self.accuracy(logits, y)

        self.log("ptl/train_loss", loss)
        self.log("ptl/train_accuracy", accuracy)
        return loss

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        accuracy = self.accuracy(logits, y)
        return {"val_loss": loss, "val_accuracy": accuracy}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
        avg_acc = torch.stack([x["val_accuracy"] for x in outputs]).mean()
        self.log("ptl/val_loss", avg_loss)
        self.log("ptl/val_accuracy", avg_acc)

    @staticmethod
    def download_data(data_dir):
        transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307, ), (0.3081, ))
        ])
        with FileLock(os.path.expanduser("~/.data.lock")):
            return MNIST(data_dir, train=True, download=True, transform=transform)

    def prepare_data(self):
        mnist_train = self.download_data(self.data_dir)

        self.mnist_train, self.mnist_val = random_split(
            mnist_train, [55000, 5000])

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=int(self.batch_size))

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=int(self.batch_size))

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
        return optimizer

def train_mnist_tune(config, num_epochs=10, num_gpus=0, data_dir="/home/kongfei/data/datasets/pytorch"):
    data_dir = os.path.expanduser(data_dir)
    model = LightningMNISTClassifier(config, data_dir)
    trainer = pl.Trainer(
        max_epochs=num_epochs,
        # If fractional GPUs passed in, convert to int.
        gpus=math.ceil(num_gpus),
        logger=TensorBoardLogger(
            save_dir=os.getcwd(), name="", version="."),
        enable_progress_bar=False,
        callbacks=[
            TuneReportCallback(
                {
                    "loss": "ptl/val_loss",
                    "mean_accuracy": "ptl/val_accuracy"
                },
                on="validation_end")
        ])
    trainer.fit(model)

def tune_mnist_asha(num_samples=10, num_epochs=10, gpus_per_trial=1, data_dir="/home/kongfei/data/datasets/pytorch"):
    ray.init(address='ray://kongfei.kongfei.life:10001', runtime_env={"env_vars": {"PL_DISABLE_FORK": "1", "RAY_record_ref_creation_sites": "1",
                                                                                   "Utils_path": "/home/kongfei/workspace/utils"}},)

    config = {
        "layer_1_size": tune.choice([32, 64, 128]),
        "layer_2_size": tune.choice([64, 128, 256]),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([32, 64, 128]),
        # "data_dir": data_dir,
    }

    scheduler = ASHAScheduler(
        max_t=num_epochs,
        grace_period=1,
        reduction_factor=2)

    reporter = CLIReporter(
        parameter_columns=["layer_1_size", "layer_2_size", "lr", "batch_size"],
        metric_columns=["loss", "mean_accuracy", "training_iteration"])

    train_fn_with_parameters = tune.with_parameters(train_mnist_tune,
                                                    num_epochs=num_epochs,
                                                    num_gpus=gpus_per_trial,
                                                    data_dir=data_dir)
    resources_per_trial = {"gpu": gpus_per_trial}

    tuner = tune.Tuner(
        tune.with_resources(
            train_fn_with_parameters,
            resources=resources_per_trial
        ),
        tune_config=tune.TuneConfig(
            metric="loss",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples,
        ),
        run_config=air.RunConfig(
            name="tune_mnist_asha",
            progress_reporter=reporter,
        ),
        param_space=config,
    )
    results = tuner.fit()

    print("Best hyperparameters found were: ", results.get_best_result().config)

tune_mnist_asha()

In fact, if I don't use API tune.with_parameter to wrap the train_mnist_tune function, this error disappear. But in another script, even I remove the API tune.with_parameter, after lots of time, all trials raise the same error. I guess it is caused by the same bug.

Issue Severity

High: It blocks me from completing my task.

amogkam commented 1 year ago

He @kong13661, can you check the file that it mentions in the logs and share what the failure is?

Check cluster logs (`/tmp/ray/session_latest/logs/*02000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 121.48.161.92) for more information about the Python worker failure.
kong13661 commented 1 year ago

He @kong13661, can you check the file that it mentions in the logs and share what the failure is?

Check cluster logs (`/tmp/ray/session_latest/logs/*02000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 121.48.161.92) for more information about the Python worker failure.

I open the file python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1625648.log. Content is below:

[2022-12-24 16:41:30,975 I 1625648 1625671] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 1625648
[2022-12-24 16:41:30,977 I 1625648 1625671] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-12-24 16:41:30,981 I 1625648 1625671] grpc_server.cc:120: driver server started, listening on port 10005.
[2022-12-24 16:41:30,988 I 1625648 1625671] core_worker.cc:215: Initializing worker at address: 121.48.161.92:10005, worker ID 01000000ffffffffffffffffffffffffffffffffffffffffffffffff, raylet 5216cced2c8f808f07ce197961be6c17ba98575a675371e7774e6b0e
[2022-12-24 16:41:30,989 I 1625648 1625690] core_worker.cc:510: Event stats:

Global stats: 12 total (8 active)
Queueing time: mean = 31.500 us, max = 202.670 us, min = 22.083 us, total = 378.002 us
Execution time:  mean = 19.715 us, total = 236.579 us
Event stats:
    PeriodicalRunner.RunFnPeriodically - 6 total (4 active, 1 running), CPU time: mean = 3.365 us, total = 20.187 us
    InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 194.379 us, total = 194.379 us
    WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 22.013 us, total = 22.013 us
    UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s

[2022-12-24 16:41:30,990 I 1625648 1625690] accessor.cc:612: Received notification for node id = f5ea429824248b6276129d8c163cc2677085f13e201496bc3d2b0c5c, IsAlive = 1
[2022-12-24 16:41:30,990 I 1625648 1625690] accessor.cc:612: Received notification for node id = 5216cced2c8f808f07ce197961be6c17ba98575a675371e7774e6b0e, IsAlive = 1
[2022-12-24 16:42:19,130 I 1625648 1625671] core_worker.cc:635: Disconnecting to the raylet.
[2022-12-24 16:42:19,130 I 1625648 1625671] raylet_client.cc:165: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0
[2022-12-24 16:42:19,130 I 1625648 1625671] core_worker.cc:578: Shutting down a core worker.
[2022-12-24 16:42:19,130 I 1625648 1625671] core_worker.cc:602: Disconnecting a GCS client.
[2022-12-24 16:42:19,130 I 1625648 1625671] core_worker.cc:606: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-12-24 16:42:19,130 I 1625648 1625690] core_worker.cc:734: Core worker main io service stopped.
[2022-12-24 16:42:19,130 I 1625648 1625671] core_worker.cc:615: Core worker ready to be deallocated.
[2022-12-24 16:42:19,130 I 1625648 1625671] core_worker.cc:569: Core worker is destructed
[2022-12-24 16:42:19,262 I 1625648 1625671] core_worker_process.cc:150: Destructing CoreWorkerProcessImpl. pid: 1625648
[2022-12-24 16:42:19,262 I 1625648 1625671] io_service_pool.cc:47: IOServicePool is stopped.
[2022-12-24 16:42:19,993 C 1625648 1625703] counter_map.h:74:  Check failed: it != counters_.end() 
*** StackTrace Information ***
/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/_raylet.so(+0xcf00ea) [0x7ff6035bb0ea] ray::operator<<()
/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/_raylet.so(+0xcf1bd2) [0x7ff6035bcbd2] ray::SpdLogMessage::Flush()
/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x7ff6035bcee7] ray::RayLog::~RayLog()
/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/_raylet.so(+0x6050ea) [0x7ff602ed00ea] CounterMap<>::Decrement()
/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker3GetERKSt6vectorINS_8ObjectIDESaIS3_EElPS2_ISt10shared_ptrINS_9RayObjectEESaISA_EE+0xad3) [0x7ff602f04a53] ray::core::CoreWorker::Get()
/home/kongfei/miniconda3/envs/pytorch-1.13/lib/python3.8/site-packages/ray/_raylet.so(+0x518647) [0x7ff602de3647] __pyx_pw_3ray_7_raylet_10CoreWorker_31get_objects()
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python() [0x4dbfe5] method_vectorcall_VARARGS_KEYWORDS
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0xa3e) [0x4c806e] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c64d5] _PyEval_EvalCodeWithName
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python() [0x4e8c84] method_vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0x17bd) [0x4c8ded] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c64d5] _PyEval_EvalCodeWithName
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x19c) [0x4da73c] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(PyObject_Call+0x5e) [0x4ecc3e] PyObject_Call
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0x2040) [0x4c9670] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c64d5] _PyEval_EvalCodeWithName
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x19c) [0x4da73c] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0x17bd) [0x4c8ded] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python() [0x4f4b1b] gen_send_ex
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0x1d5c) [0x4c938c] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python() [0x4f4b1b] gen_send_ex
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python() [0x547cec] builtin_next
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python() [0x4dafa7] cfunction_vectorcall_FASTCALL
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0x907) [0x4c7f37] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x106) [0x4da6a6] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0x907) [0x4c7f37] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x106) [0x4da6a6] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0x907) [0x4c7f37] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c64d5] _PyEval_EvalCodeWithName
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x19c) [0x4da73c] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(PyObject_Call+0x5e) [0x4ecc3e] PyObject_Call
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0x2040) [0x4c9670] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x106) [0x4da6a6] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0xa3e) [0x4c806e] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x106) [0x4da6a6] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(PyObject_Call+0x5e) [0x4ecc3e] PyObject_Call
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0x2040) [0x4c9670] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x106) [0x4da6a6] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0xa3e) [0x4c806e] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x106) [0x4da6a6] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyEval_EvalFrameDefault+0xa3e) [0x4c806e] _PyEval_EvalFrameDefault
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(_PyFunction_Vectorcall+0x106) [0x4da6a6] _PyFunction_Vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python() [0x4e8cd5] method_vectorcall
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python(PyObject_Call+0x5e) [0x4ecc3e] PyObject_Call
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python() [0x59c037] t_bootstrap
/home/kongfei/miniconda3/envs/pytorch-1.13/bin/python() [0x59bf94] pythread_wrapper
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7ff60453e609] start_thread
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff604309133] __clone

@amogkam

smiraldr commented 1 year ago

Hey, @amogkam I also had object's owner has exited Is this related , can you check once?

I had run this example from ray docs for huggingface + ray train with some modifications: https://gist.github.com/smiraldr/be875e09281e4a55fa600da9c143b8bf

Runtime logs:

(base) root@gc-unruffled-jemison:~# python3 hf.py 
2023-10-30 08:48:45.472739: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-30 08:48:45.646029: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-30 08:48:46.736852: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-30 08:48:46.736997: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-30 08:48:46.737021: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.
2023-10-30 08:48:51,627 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 192.168.16.197:6379...
2023-10-30 08:48:51,639 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at 192.168.16.197:8265 
2023-10-30 08:48:51,733 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949

View detailed results here: /home/ray/ray_results/TorchTrainer_2023-10-30_08-48-51
To visualize your results with TensorBoard, run: `tensorboard --logdir /home/ray/ray_results/TorchTrainer_2023-10-30_08-48-51`
(TrainTrainable pid=581, ip=192.168.8.173) Local object store memory usage:
(TrainTrainable pid=581, ip=192.168.8.173) 
(TrainTrainable pid=581, ip=192.168.8.173) (global lru) capacity: 12456522547
(TrainTrainable pid=581, ip=192.168.8.173) (global lru) used: 0%
(TrainTrainable pid=581, ip=192.168.8.173) (global lru) num objects: 0
(TrainTrainable pid=581, ip=192.168.8.173) (global lru) num evictions: 0
(TrainTrainable pid=581, ip=192.168.8.173) (global lru) bytes evicted: 0
(TrainTrainable pid=581, ip=192.168.8.173) 
(TrainTrainable pid=581, ip=192.168.8.173) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=581, ip=100.121.244.176, actor_id=5a3c8a5d9e25e895f3fa826a06000000, repr=TorchTrainer)
(TrainTrainable pid=581, ip=192.168.8.173)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 185, in __init__
(TrainTrainable pid=581, ip=192.168.8.173)     self.setup(copy.deepcopy(self.config))
(TrainTrainable pid=581, ip=192.168.8.173)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 304, in setup
(TrainTrainable pid=581, ip=192.168.8.173)     setup_kwargs[k] = parameter_registry.get(prefix + k)
(TrainTrainable pid=581, ip=192.168.8.173)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/registry.py", line 301, in get
(TrainTrainable pid=581, ip=192.168.8.173)     return ray.get(self.references[k])
(TrainTrainable pid=581, ip=192.168.8.173) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0600000002e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(TrainTrainable pid=581, ip=192.168.8.173) 
(TrainTrainable pid=581, ip=192.168.8.173) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*06000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 192.168.16.197) for more information about the Python worker failure.
2023-10-30 08:49:17,465 ERROR tune_controller.py:1502 -- Trial task failed for trial TorchTrainer_d2bc9_00000
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2549, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=581, ip=100.121.244.176, actor_id=5a3c8a5d9e25e895f3fa826a06000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 185, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 304, in setup
    setup_kwargs[k] = parameter_registry.get(prefix + k)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/registry.py", line 301, in get
    return ray.get(self.references[k])
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0600000002e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*06000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 192.168.16.197) for more information about the Python worker failure.

Training errored after 0 iterations at 2023-10-30 08:49:17. Total running time: 25s
Error file: /home/ray/ray_results/TorchTrainer_2023-10-30_08-48-51/TorchTrainer_d2bc9_00000_0_2023-10-30_08-48-52/error.txt

2023-10-30 08:49:17,470 ERROR tune.py:1139 -- Trials did not complete: [TorchTrainer_d2bc9_00000]
2023-10-30 08:49:17,474 WARNING experiment_analysis.py:205 -- Failed to fetch metrics for 1 trial(s):
- TorchTrainer_d2bc9_00000: FileNotFoundError('Could not fetch metrics for TorchTrainer_d2bc9_00000: both result.json and progress.csv were not found at /home/ray/ray_results/TorchTrainer_2023-10-30_08-48-51/TorchTrainer_d2bc9_00000_0_2023-10-30_08-48-52')
TuneError: Failure # 1 (occurred at 2023-10-30_08-49-17)
The actor died because of an error raised in its creation task, [36mray::_Inner.__init__()[39m (pid=581, ip=100.121.244.176, actor_id=5a3c8a5d9e25e895f3fa826a06000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 185, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 304, in setup
    setup_kwargs[k] = parameter_registry.get(prefix + k)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/registry.py", line 301, in get
    return ray.get(self.references[k])
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0600000002e1f505. To see information about where this ObjectRef was created in Python, set the 
environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs 
(`/tmp/ray/session_latest/logs/*06000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 192.168.16.197) for more information about the Python worker failure.

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ray/hf.py:83 in <module>                                                                   │
│                                                                                                  │
│   80 ray_trainer = TorchTrainer(                                                                 │
│   81 │   train_func, scaling_config=ScalingConfig(num_workers=2, use_gpu=True)                   │
│   82 )                                                                                           │
│ ❱ 83 ray_trainer.fit()                                                                           │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.9/site-packages/ray/train/base_trainer.py:668 in fit             │
│                                                                                                  │
│   665 │   │   if result.error:                                                                   │
│   666 │   │   │   # Raise trainable errors to the user with a message to restore                 │
│   667 │   │   │   # or configure `FailureConfig` in a new run.                                   │
│ ❱ 668 │   │   │   raise TrainingFailedError(                                                     │
│   669 │   │   │   │   "\n".join([restore_msg, TrainingFailedError._FAILURE_CONFIG_MSG])          │
│   670 │   │   │   ) from result.error                                                            │
│   671 │   │   return result                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but 
rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: `trainer = TorchTrainer.restore("/home/ray/ray_results/TorchTrainer_2023-10-30_08-48-51")`.
To start a new run that will retry on training failures, set `train.RunConfig(failure_config=train.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or `max_failures
= -1` for unlimited retries.

/tmp/ray/session_latest/logs/06000000ffffffffffffffffffffffffffffffffffffffffffffffff Output:

[2023-10-30 08:48:51,644 I 7500 7500] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 7500
[2023-10-30 08:48:51,646 I 7500 7500] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-10-30 08:48:51,652 I 7500 7500] grpc_server.cc:129: driver server started, listening on port 10006.
[2023-10-30 08:48:51,659 I 7500 7500] core_worker.cc:227: Initializing worker at address: 192.168.16.197:10006, worker ID 06000000ffffffffffffffffffffffffffffffffffffffffffffffff, raylet 0e6da528aae9f90cc09fb22babb40360918121dddd614df018b77792
[2023-10-30 08:48:51,660 I 7500 7500] task_event_buffer.cc:190: Reporting task events to GCS every 1000ms.
[2023-10-30 08:48:51,661 I 7500 7597] core_worker.cc:570: Event stats:

Global stats: 7 total (4 active)
Queueing time: mean = 21.646 us, max = 100.401 us, min = 21.257 us, total = 151.520 us
Execution time:  mean = 31.595 us, total = 221.165 us
Event stats:
    PeriodicalRunner.RunFnPeriodically - 2 total (1 active, 1 running), CPU time: mean = 9.509 us, total = 19.018 us
    UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 167.147 us, total = 167.147 us
    WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 35.000 us, total = 35.000 us

-----------------
Task Event stats:

IO Service Stats:

Global stats: 2 total (1 active)
Queueing time: mean = 4.523 us, max = 9.046 us, min = 9.046 us, total = 9.046 us
Execution time:  mean = 6.109 us, total = 12.219 us
Event stats:
    CoreWorker.deadline_timer.flush_task_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    PeriodicalRunner.RunFnPeriodically - 1 total (0 active), CPU time: mean = 12.219 us, total = 12.219 us
Other Stats:
    grpc_in_progress:0
    current number of task events in buffer: 1
    total task events sent: 0 MiB
    total number of task events sent: 0
    num status task events dropped: 0
    num profile task events dropped: 0

[2023-10-30 08:48:51,661 I 7500 7597] accessor.cc:611: Received notification for node id = 75144c5dda31c02a95c6136be651b52e49c1082967eb32e2ef0abe59, IsAlive = 1
[2023-10-30 08:48:51,661 I 7500 7597] core_worker.cc:4262: Number of alive nodes:1
[2023-10-30 08:48:51,661 I 7500 7597] accessor.cc:611: Received notification for node id = 7fb2a236e768fc2ea3f19adf867b211e10ca0af53406a58d024f6312, IsAlive = 1
[2023-10-30 08:48:51,661 I 7500 7597] accessor.cc:611: Received notification for node id = 0e6da528aae9f90cc09fb22babb40360918121dddd614df018b77792, IsAlive = 1
[2023-10-30 08:48:51,661 I 7500 7597] accessor.cc:611: Received notification for node id = 209e8c99aa57aca3e213ac1a924d87c369511a4bf7193d9bc03a6afd, IsAlive = 0
[2023-10-30 08:48:51,661 I 7500 7597] core_worker.cc:306: Node failure from 209e8c99aa57aca3e213ac1a924d87c369511a4bf7193d9bc03a6afd. All objects pinned on that node will be lost if object reconstruction is not enabled.
[2023-10-30 08:48:51,662 I 7500 7500] event.cc:234: Set ray event level to warning
[2023-10-30 08:48:51,662 I 7500 7500] event.cc:342: Ray Event initialized for CORE_WORKER
[2023-10-30 08:48:52,365 I 7500 7500] core_worker.cc:2131: Submitting Placement Group creation to GCS: 6b1e37d945742c16a5b63039cb9906000000
[2023-10-30 08:48:52,680 I 7500 7597] direct_task_transport.cc:289: Connecting to raylet 7fb2a236e768fc2ea3f19adf867b211e10ca0af53406a58d024f6312
[2023-10-30 08:48:54,405 I 7500 7500] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 5a3c8a5d9e25e895f3fa826a06000000
[2023-10-30 08:48:54,409 I 7500 7597] actor_manager.cc:214: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2023-10-30 08:48:54,410 I 7500 7597] actor_manager.cc:214: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2023-10-30 08:48:54,410 I 7500 7597] actor_manager.cc:214: received notification on actor, state: PENDING_CREATION, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2023-10-30 08:49:17,327 I 7500 7597] direct_task_transport.cc:55: Actor creation failed and we will not be retrying the creation task, actor id = 5a3c8a5d9e25e895f3fa826a06000000, task id = ffffffffffffffff5a3c8a5d9e25e895f3fa826a06000000
[2023-10-30 08:49:17,327 I 7500 7597] actor_manager.cc:214: received notification on actor, state: ALIVE, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: 100.121.244.176, port: 10006, worker_id: 085b533d0b5e049ca3e62b56977c27a2db28a7098460cb17c7a6469c, raylet_id: 75144c5dda31c02a95c6136be651b52e49c1082967eb32e2ef0abe59, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2023-10-30 08:49:17,328 I 7500 7597] direct_actor_task_submitter.cc:237: Connecting to actor 5a3c8a5d9e25e895f3fa826a06000000 at worker 085b533d0b5e049ca3e62b56977c27a2db28a7098460cb17c7a6469c
[2023-10-30 08:49:17,462 I 7500 7597] actor_manager.cc:214: received notification on actor, state: DEAD, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: 100.121.244.176, port: 10006, worker_id: 085b533d0b5e049ca3e62b56977c27a2db28a7098460cb17c7a6469c, raylet_id: 75144c5dda31c02a95c6136be651b52e49c1082967eb32e2ef0abe59, num_restarts: 0, death context type=CreationTaskFailureContext
[2023-10-30 08:49:17,462 I 7500 7597] direct_actor_task_submitter.cc:287: Failing pending tasks for actor 5a3c8a5d9e25e895f3fa826a06000000 because the actor is already dead.
[2023-10-30 08:49:17,463 I 7500 7597] task_manager.cc:829: task 4083cef8db49d2ff5a3c8a5d9e25e895f3fa826a06000000 retries left: 0, oom retries left: 0, task failed due to oom: 0
[2023-10-30 08:49:17,463 I 7500 7597] task_manager.cc:845: No retries left for task 4083cef8db49d2ff5a3c8a5d9e25e895f3fa826a06000000, not going to resubmit.
[2023-10-30 08:49:17,463 I 7500 7597] task_manager.cc:903: Task failed: IOError: Fail all inflight tasks due to actor state change.: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.tune.trainable.util, class_name=with_parameters.<locals>._Inner, function_name=__ray_ready__, function_hash=}, task_id=4083cef8db49d2ff5a3c8a5d9e25e895f3fa826a06000000, task_name=_Inner.__ray_ready__, job_id=06000000, num_args=0, num_returns=1, depth=1, attempt_number=0, actor_task_spec={actor_id=5a3c8a5d9e25e895f3fa826a06000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff06000000, actor_counter=0}, runtime_env_hash=-898807589, eager_install=1, setup_timeout_seconds=600
[2023-10-30 08:49:18,111 I 7500 7500] core_worker.cc:718: Disconnecting to the raylet.
[2023-10-30 08:49:18,111 I 7500 7500] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0
[2023-10-30 08:49:18,111 I 7500 7500] core_worker.cc:641: Shutting down a core worker.
[2023-10-30 08:49:18,111 I 7500 7500] task_event_buffer.cc:201: Shutting down TaskEventBuffer.
[2023-10-30 08:49:18,111 I 7500 7604] task_event_buffer.cc:183: Task event buffer io service stopped.
[2023-10-30 08:49:18,111 I 7500 7500] core_worker.cc:667: Disconnecting a GCS client.
[2023-10-30 08:49:18,111 I 7500 7597] core_worker.cc:884: Core worker main io service stopped.
[2023-10-30 08:49:18,112 I 7500 7500] core_worker.cc:671: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2023-10-30 08:49:18,118 I 7500 7500] core_worker.cc:684: Core worker ready to be deallocated.
[2023-10-30 08:49:18,118 I 7500 7500] core_worker.cc:632: Core worker is destructed
[2023-10-30 08:49:18,118 I 7500 7500] task_event_buffer.cc:201: Shutting down TaskEventBuffer.
[2023-10-30 08:49:18,119 W 7500 7607] server_call.h:324: [1] Not sending reply because executor stopped.
[2023-10-30 08:49:18,120 I 7500 7500] core_worker_process.cc:148: Destructing CoreWorkerProcessImpl. pid: 7500
[2023-10-30 08:49:18,120 I 7500 7500] io_service_pool.cc:47: IOServicePool is stopped.
[2023-10-30 08:49:18,165 I 7500 7500] stats.h:128: Stats module has shutdown.