ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

Trials hang when using a scheduler #253

Open dcfidalgo opened 1 year ago

dcfidalgo commented 1 year ago

Hi there! I first encountered this issue when trying to do a PBT on a Multi-node DDP setup (4GPUs per node, each node is a population member), but I could not consistently reproduce it. But now I managed to reproduce the same behavior using an ASHA scheduler: As soon as the ASHA scheduler terminates a trial, the subsequent trials simply hang in the RUNNING status and never terminate.

== Status ==
Current time: 2023-03-17 10:12:33 (running for 00:00:41.50)
Memory usage on this node: 154.0/250.9 GiB 
Using AsyncHyperBand: num_stopped=1
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: -1.25
Resources requested: 3.0/4 CPUs, 0/0 GPUs, 0.0/64.44 GiB heap, 0.0/31.61 GiB objects
Result logdir: /dcfidalgo/ray_results/train_func_2023-03-17_10-11-51
Number of trials: 3/3 (1 RUNNING, 2 TERMINATED)
+------------------------+------------+---------------------+------------+--------+------------------+------------+
| Trial name             | status     | loc                 |   val_loss |   iter |   total time (s) |   val_loss |
|------------------------+------------+---------------------+------------+--------+------------------+------------|
| train_func_c1436_00002 | RUNNING    | 10.181.103.72:74356 |          3 |        |                  |            |
| train_func_c1436_00000 | TERMINATED | 10.181.103.72:74356 |          1 |      1 |          6.91809 |          1 |
| train_func_c1436_00001 | TERMINATED | 10.181.103.72:74356 |          2 |      1 |          6.20699 |          2 |
+------------------------+------------+---------------------+------------+--------+------------------+------------+

I could trace back the issue to a hanging ray.get call when trying to get the self._master_addr here. But I simply cannot figure out what the underlying cause is ...

A minimal script to reproduce the issue:

import torch
import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler

from ray_lightning import RayStrategy
from ray_lightning.tests.utils import BoringModel, get_trainer
from ray_lightning.tune import TuneReportCallback, get_tune_resources

class AnotherBoringModel(BoringModel):
    def __init__(self, val_loss: float):
        super().__init__()
        self._val_loss = torch.tensor(val_loss)

    def validation_step(self, batch, batch_idx):
        self.log("val_loss", self._val_loss)
        return {"x": self._val_loss}

address_info = ray.init(num_cpus=4)

strategy = RayStrategy(num_workers=2, use_gpu=False)
callbacks = [TuneReportCallback(on="validation_end")]

def train_func(config):
    model = AnotherBoringModel(config["val_loss"])
    trainer = get_trainer(
        "./",
        callbacks=callbacks,
        strategy=strategy,
        checkpoint_callback=False,
        max_epochs=1)
    trainer.fit(model)

tune.run(
    train_func,
    config={"val_loss": tune.grid_search([1., 2., 3.])},
    resources_per_trial=get_tune_resources(
        num_workers=strategy.num_workers, use_gpu=strategy.use_gpu),
    num_samples=1,
    scheduler=AsyncHyperBandScheduler(metric="val_loss", mode="min")
)

If you remove the scheduler, the above script terminates without issues.

A corresponding conda env:

name: schedulerbug
channels:
  - pytorch
dependencies:
  - python=3.9
  - pytorch==1.11.0
  - cpuonly
  - pip
  - pip:
    - pytorch-lightning==1.6.4
    - ray[tune]==2.3.0
    - git+https://github.com/ray-project/ray_lightning.git@main

Is someone experiencing the same issue? Any kind of help would be very much appreciated! :smiley: Have a great day!