[Core] `ray status` not consistent when using placement groups

JiahaoYao commented 2 years ago

What happened + What you expected to happen

The placement bundles in the ray tune does not disappear after the program shuts down. There might be something wrong with the recalculations of the resources. @matthewdeng @rkooo567 @scv119

the ray.status output has (0 used of 0.0 reserved in placement groups) remained.

======== Autoscaler status: 2022-06-17 03:12:41.288131 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
 1 ray.worker.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU (0 used of 0.0 reserved in placement groups)
 0.0/2.0 GPU (0 used of 0.0 reserved in placement groups)
 0.0/2.0 accelerator_type:T4
 0.00/19.729 GiB memory
 0.00/8.753 GiB object_store_memory

sometimes the outputs can be negative

this is the outputs from https://github.com/ray-project/ray/blob/master/python/ray/scripts/scripts.py and has bundles not eliminated.

ic| status: (b'{"load_metrics_report": {"usage": {"CPU_group_1_785075fe890d3fda0da8888caaba'
             b'01000000": [0.0, 1.0], "node:10.0.2.234_group_1_785075fe890d3fda0da8888caaba'
             b'01000000": [0.0, 0.001], "memory": [0.0, 21183402803.0], "node:10.0.2.234_gr'
             b'oup_1_261bb8020f7da0ebaf1e85a6ea7802000000": [0.0, 0.001], "bundle_group_1_2'
             b'61bb8020f7da0ebaf1e85a6ea7802000000": [0.0, 1000.0], "GPU_group_1_785075fe89'
             b'0d3fda0da8888caaba01000000": [0.0, 1.0], "accelerator_type:T4": [0.0, 2.0], '
             b'"GPU": [0.0, 2.0], "GPU_group_1_261bb8020f7da0ebaf1e85a6ea7802000000": [0.0,'
             b' 1.0], "object_store_memory": [0.0, 9398036889.0], "bundle_group_785075fe890'
             b'd3fda0da8888caaba01000000": [1000.0, 3000.0], "CPU_group_1_261bb8020f7da0eba'
             b'f1e85a6ea7802000000": [0.0, 1.0], "CPU": [0.0, 8.0], "node:10.0.2.234_group_'
             b'0_261bb8020f7da0ebaf1e85a6ea7802000000": [0.0, 0.001], "node:10.0.2.234": [0'
             b'.0, 1.0], "bundle_group_1_785075fe890d3fda0da8888caaba01000000": [0.0, 1000.'
             b'0], "CPU_group_0_261bb8020f7da0ebaf1e85a6ea7802000000": [0.0, 1.0], "bundle_'
             b'group_261bb8020f7da0ebaf1e85a6ea7802000000": [0.0, 3000.0], "bundle_group_0_'
             b'261bb8020f7da0ebaf1e85a6ea7802000000": [0.0, 1000.0], "node:10.0.2.195_group'
             b'_2_785075fe890d3fda0da8888caaba01000000": [0.0, 0.001], "CPU_group_2_785075f'
             b'e890d3fda0da8888caaba01000000": [0.0, 1.0], "GPU_group_2_785075fe890d3fda0da'
             b'8888caaba01000000": [0.0, 1.0], "GPU_group_2_261bb8020f7da0ebaf1e85a6ea78020'
             b'00000": [0.0, 1.0], "bundle_group_2_261bb8020f7da0ebaf1e85a6ea7802000000": ['
             b'0.0, 1000.0], "CPU_group_2_261bb8020f7da0ebaf1e85a6ea7802000000": [0.0, 1.0]'
             b', "node:10.0.2.195_group_2_261bb8020f7da0ebaf1e85a6ea7802000000": [0.0, 0.00'
             b'1], "node:10.0.2.195": [0.0, 1.0], "bundle_group_2_785075fe890d3fda0da8888ca'
             b'aba01000000": [0.0, 1000.0]}, "resource_demand": [], "pg_demand": [], "reque'
             b'st_demand": [], "node_types": [[{"bundle_group_1_261bb8020f7da0ebaf1e85a6ea7'
             b'802000000": 1000.0, "accelerator_type:T4": 1.0, "memory": 9157494375.0, "nod'
             b'e:10.0.2.234_group_1_785075fe890d3fda0da8888caaba01000000": 0.001, "GPU_grou'
             b'p_1_261bb8020f7da0ebaf1e85a6ea7802000000": 1.0, "node:10.0.2.234": 1.0, "nod'
             b'e:10.0.2.234_group_1_261bb8020f7da0ebaf1e85a6ea7802000000": 0.001, "CPU": 4.'
             b'0, "CPU_group_1_785075fe890d3fda0da8888caaba01000000": 1.0, "bundle_group_1_'
             b'785075fe890d3fda0da8888caaba01000000": 1000.0, "CPU_group_1_261bb8020f7da0eb'
             b'af1e85a6ea7802000000": 1.0, "node:10.0.2.234_group_0_261bb8020f7da0ebaf1e85a'
             b'6ea7802000000": 0.001, "GPU": 1.0, "object_store_memory": 4578747187.0, "GPU'
             b'_group_1_785075fe890d3fda0da8888caaba01000000": 1.0, "bundle_group_261bb8020'
             b'f7da0ebaf1e85a6ea7802000000": 2000.0, "bundle_group_785075fe890d3fda0da8888c'
             b'aaba01000000": 2000.0, "CPU_group_0_261bb8020f7da0ebaf1e85a6ea7802000000": 1'
             b'.0, "bundle_group_0_261bb8020f7da0ebaf1e85a6ea7802000000": 1000.0}, 1], [{"b'
             b'undle_group_2_261bb8020f7da0ebaf1e85a6ea7802000000": 1000.0, "accelerator_ty'
             b'pe:T4": 1.0, "bundle_group_261bb8020f7da0ebaf1e85a6ea7802000000": 1000.0, "o'
             b'bject_store_memory": 4819289702.0, "bundle_group_2_785075fe890d3fda0da8888ca'
             b'aba01000000": 1000.0, "GPU": 1.0, "node:10.0.2.195_group_2_261bb8020f7da0eba'
             b'f1e85a6ea7802000000": 0.001, "node:10.0.2.195_group_2_785075fe890d3fda0da888'
             b'8caaba01000000": 0.001, "CPU": 4.0, "GPU_group_2_785075fe890d3fda0da8888caab'
             b'a01000000": 1.0, "bundle_group_785075fe890d3fda0da8888caaba01000000": 1000.0'
             b', "memory": 12025908428.0, "GPU_group_2_261bb8020f7da0ebaf1e85a6ea7802000000'
             b'": 1.0, "node:10.0.2.195": 1.0, "CPU_group_2_261bb8020f7da0ebaf1e85a6ea78020'
             b'00000": 1.0, "CPU_group_2_785075fe890d3fda0da8888caaba01000000": 1.0}, 1]], '
             b'"head_ip": null}, "time": 1655435561.2881305, "monitor_pid": 5394, "autoscal'
             b'er_report": {"active_nodes": {"ray.head.default": 1, "ray.worker.default": 1'
             b'}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}')

Versions / Dependencies

ray nightly

Reproduction script

I was using alpa + ray tune (https://github.com/alpa-projects/alpa/issues/508) to run the code, this alpa issue (https://github.com/alpa-projects/alpa/issues/521) launches 7 ray workers on 2 gpu nodes.

import numpy as np
import ray
from ray import tune

import time

from absl import logging
import ml_collections
import numpy as np
import optax

import argparse

ray.init("auto")
# ray.init()

def train_and_evaluate(config):
    """Execute model training and evaluation loop.

    Args:
      config: Hyperparameter configuration for training and evaluation.
      workdir: Directory where the tensorboard summaries are written to.

    Returns:
      The train state (which includes the `.params`).
    """
    config = argparse.Namespace(**config)

    import os

    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    import alpa

    alpa.init("ray")

    from flax import linen as nn
    from flax.metrics import tensorboard
    from flax.training import train_state
    import jax.numpy as jnp
    import jax

    # alpa.util.update_jax_platform('cpu')

    print(ray._private.services.get_node_ip_address())
    print(jax.devices())

    class CNN(nn.Module):
        """A simple CNN model."""

        @nn.compact
        def __call__(self, x):
            x = x.reshape((x.shape[0], -1))  # flatten
            x = nn.Dense(features=256)(x)
            x = nn.relu(x)
            x = nn.Dense(features=10)(x)
            return x

    @jax.jit
    def apply_model(state, images, labels):
        """Computes gradients, loss and accuracy for a single batch."""

        def loss_fn(params):
            logits = CNN().apply({"params": params}, images)
            one_hot = jax.nn.one_hot(labels, 10)
            loss = jnp.mean(optax.softmax_cross_entropy(logits=logits, labels=one_hot))
            return loss, logits

        grad_fn = jax.value_and_grad(loss_fn, has_aux=True)
        (loss, logits), grads = grad_fn(state.params)
        accuracy = jnp.mean(jnp.argmax(logits, -1) == labels)
        return grads, loss, accuracy

    @jax.jit
    def update_model(state, grads):
        return state.apply_gradients(grads=grads)

    @alpa.parallelize
    def train_step(state, images, labels):
        """Computes gradients, loss and accuracy for a single batch."""

        def loss_fn(params):
            logits = CNN().apply({"params": params}, images)
            one_hot = jax.nn.one_hot(labels, 10)
            loss = jnp.mean(optax.softmax_cross_entropy(logits=logits, labels=one_hot))
            return loss, logits

        grad_fn = jax.value_and_grad(loss_fn, has_aux=True)
        (loss, logits), grads = grad_fn(state.params)
        accuracy = jnp.mean(jnp.argmax(logits, -1) == labels)
        state = state.apply_gradients(grads=grads)
        return state, loss, accuracy

    def train_epoch(state, train_ds, batch_size, rng):
        """Train for a single epoch."""
        train_ds_size = len(train_ds["image"])
        steps_per_epoch = train_ds_size // batch_size

        epoch_loss = []
        epoch_accuracy = []

        for i in range(steps_per_epoch):
            batch_images = train_ds["image"][i * batch_size : (i + 1) * batch_size]
            batch_labels = train_ds["label"][i * batch_size : (i + 1) * batch_size]
            state, loss, accuracy = train_step(state, batch_images, batch_labels)
            epoch_loss.append(loss)
            epoch_accuracy.append(accuracy)
        train_loss = np.mean(epoch_loss)
        train_accuracy = np.mean(epoch_accuracy)
        return state, train_loss, train_accuracy

    def get_datasets():
        """Load MNIST train and test datasets into memory."""
        import tensorflow as tf
        import tensorflow_datasets as tfds

        tf.config.experimental.set_visible_devices([], "GPU")
        ds_builder = tfds.builder("mnist")
        ds_builder.download_and_prepare()
        train_ds = tfds.as_numpy(ds_builder.as_dataset(split="train", batch_size=-1))
        test_ds = tfds.as_numpy(ds_builder.as_dataset(split="test", batch_size=-1))
        train_ds["image"] = np.float32(train_ds["image"]) / 255.0
        test_ds["image"] = np.float32(test_ds["image"]) / 255.0
        train_ds["label"] = np.int32(train_ds["label"])
        test_ds["label"] = np.int32(test_ds["label"])
        return train_ds, test_ds

    def create_train_state(rng, config):
        """Creates initial `TrainState`."""
        cnn = CNN()
        params = cnn.init(rng, jnp.ones([1, 28, 28, 1]))["params"]
        tx = optax.sgd(config.learning_rate, config.momentum)
        return train_state.TrainState.create(apply_fn=cnn.apply, params=params, tx=tx)

    train_ds, test_ds = get_datasets()
    rng = jax.random.PRNGKey(0)
    rng, init_rng = jax.random.split(rng)
    state = create_train_state(init_rng, config)

    for epoch in range(1, config.num_epochs + 1):
        rng, input_rng = jax.random.split(rng)
        tic = time.time()
        state, train_loss, train_accuracy = train_epoch(
            state, train_ds, config.batch_size, input_rng
        )
        epoch_time = time.time() - tic
        test_loss = test_accuracy = 0.0
        logging.info(
            "epoch:% 3d, train_loss: %.4f, train_accuracy: %.2f, epoch_time: %.3f"
            % (epoch, train_loss, train_accuracy * 100, epoch_time)
        )
        print(
            "epoch:% 3d, train_loss: %.4f, train_accuracy: %.2f, epoch_time: %.3f"
            % (epoch, train_loss, train_accuracy * 100, epoch_time)
        )

    tune.report(train_loss=train_loss)

search_space = {
    "learning_rate": 0.1,
    "momentum": tune.uniform(0.1, 0.9),
    "batch_size": 8192,
    "num_epochs": 100,
}

ips = list(filter(lambda x: "node:" in x, ray.available_resources().keys()))
resources_pg = [{"CPU": 1, ips[0]: 0.001}]
for ip in ips:
    # resources_pg.append({"CPU": 1, "GPU": 1, ip: 0.001})
    resources_pg.append({"CPU": 1, "GPU": 1})

analysis = tune.run(
    train_and_evaluate,
    config=search_space,
    resources_per_trial=tune.PlacementGroupFactory(resources_pg),
)
print(analysis.results)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

zhe-thoughts commented 2 years ago

@richardliaw could you help triaging this one? I wasn't able to figure out whether it's a Core issue or Tune issue

richardliaw commented 2 years ago

Believe this is a core issue.

cadedaniel commented 1 year ago

@rkooo567 can you add context why this is a release blocker?

cadedaniel commented 1 year ago

We should attempt reproduction once Alex's autoscaler changes are made. that will tell us if it's autoscaler or in core.

rkooo567 commented 1 year ago

I think this is the real core leak issue. but no one could find the repro so far...

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

ray-project / ray