ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.55k stars 5.69k forks source link

[Train] [Tune] When using Train with Tune, a `logdir` is created that's not the one specified by the user #25474

Open VishDev12 opened 2 years ago

VishDev12 commented 2 years ago

What happened + What you expected to happen

When initializing a Ray Trainer, we provide a logdir argument, and the __init__ method of the Trainer stores it as a logdir class variable.

Then, when creating a Trainable with Trainer.to_tune_trainable(), it in-turn calls _create_tune_trainable(), which does not use self.logdir. So when tune_function is defined inside _create_tune_trainable with a Trainer initialization call, there's no logdir passed to it, and so, this Trainer ends up creating its own logdir in the default path ~/ray_results.

https://github.com/ray-project/ray/blob/7f1bacc7dc9caf6d0ec042e39499bbf1d9a7d065/python/ray/train/trainer.py#L828-L843

This could be solved by passing self.logdir along to _create_tune_trainable and using that in the Trainer initialization.

Additionally, we also have the issue of the non-customizable run directory created by Ray Train as run_<run_id>. But since this directory is unused, it's not too much of an issue. And I believe it's partially being tracked here: https://github.com/ray-project/ray/issues/20807

Versions / Dependencies

Python: 3.8.13 Ray: 1.12.1 Ubuntu 18.04.5

Reproduction script

This isn't a reproduction script but should serve as an indicator of the variables that are being passed in.

from ray import tune
from ray.train import Trainer

trainer = Trainer(
    backend="torch",
    num_workers=2,
    logdir=logdir,
)

trainable = trainer.to_tune_trainable(train_func)

tune.run(
    trainable,
    name=experiment_name,
    config=config,
    trial_name_creator=tune_trial_name_creator,
    trial_dirname_creator=tune_trial_dirname_creator,
    local_dir=local_dir,
    verbose=0,
)

Notes

  1. The tune_trial_name_creator function generates a unique name and uses the config in the ray.tune.trial.Trial object that's passed in to store the generated trial_name.
  2. The tune_dirname_creator function simply returns the trial_name accessed from the Trial object. This way, we use the exact trial_name as the directory name.
  3. The trial directory used by Tune is thus: local_dir/experiment_name/ trial_name.
  4. The logdir passed to the Trainer is local_dir/experiment_name.

Issue Severity

Low

peytondmurray commented 2 years ago

Hi @VishDev12, thanks for reporting this, and for the detailed description of the issue. I'd be happy to look into implementing the change needed to propagate the requested logging directory so that the logs appear where the user expects to see it.

JiahaoYao commented 2 years ago

i am able to reproduce the issue

import argparse

import numpy as np
import torch
import torch.nn as nn
import ray.train as train
from ray.train import Trainer
from ray.train.callbacks import JsonLoggerCallback, TBXLoggerCallback

class LinearDataset(torch.utils.data.Dataset):
    """y = a * x + b"""

    def __init__(self, a, b, size=1000):
        x = np.arange(0, 10, 10 / size, dtype=np.float32)
        self.x = torch.from_numpy(x)
        self.y = torch.from_numpy(a * x + b)

    def __getitem__(self, index):
        return self.x[index, None], self.y[index, None]

    def __len__(self):
        return len(self.x)

def train_epoch(dataloader, model, loss_fn, optimizer):
    for X, y in dataloader:
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

def validate_epoch(dataloader, model, loss_fn):
    num_batches = len(dataloader)
    model.eval()
    loss = 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            loss += loss_fn(pred, y).item()
    loss /= num_batches
    import copy

    model_copy = copy.deepcopy(model)
    result = {"model": model_copy.cpu().state_dict(), "loss": loss}
    return result

def train_func(config):
    data_size = config.get("data_size", 1000)
    val_size = config.get("val_size", 400)
    batch_size = config.get("batch_size", 32)
    hidden_size = config.get("hidden_size", 1)
    lr = config.get("lr", 1e-2)
    epochs = config.get("epochs", 3)

    train_dataset = LinearDataset(2, 5, size=data_size)
    val_dataset = LinearDataset(2, 5, size=val_size)
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size)
    validation_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size)

    train_loader = train.torch.prepare_data_loader(train_loader)
    validation_loader = train.torch.prepare_data_loader(validation_loader)

    model = nn.Linear(1, hidden_size)
    model = train.torch.prepare_model(model)

    loss_fn = nn.MSELoss()

    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    results = []

    for _ in range(epochs):
        train_epoch(train_loader, model, loss_fn, optimizer)
        result = validate_epoch(validation_loader, model, loss_fn)
        train.report(**result)
        results.append(result)

    return results

def train_linear(num_workers=2, use_gpu=False, epochs=3):
    trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=use_gpu, logdir='abc')
    config = {"lr": 1e-2, "hidden_size": 1, "batch_size": 4, "epochs": epochs}    
    from ray import tune
    trainable = trainer.to_tune_trainable(train_func)

    search_space = {
        "lr": tune.sample_from(lambda spec: 10 ** (-10 * np.random.rand())),
        "momentum": tune.uniform(0.1, 0.9),
    }

    analysis = tune.run(
        trainable,
        name='github_issue',
        config=search_space,
        num_samples=10,
        verbose=1,
    )

    print(analysis.results_df)
    return 0

import ray
ray.init('auto', ignore_reinit_error=True)
train_linear(
    num_workers=2, use_gpu=False, epochs=10
)

the logs are

(TrainTrainable pid=10125) 2022-06-04 11:58:26,340  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-26
(TrainTrainable pid=10193) 2022-06-04 11:58:28,557  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-28
(TrainTrainable pid=10125) 2022-06-04 11:58:29,842  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-26/run_001
(BaseWorkerMixin pid=10254) 2022-06-04 11:58:29,818 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=10255) 2022-06-04 11:58:29,821 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
WARNING:root:NaN or Inf found in input tensor.
(BaseWorkerMixin pid=10254) 2022-06-04 11:58:29,864 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10254) 2022-06-04 11:58:29,865 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10255) 2022-06-04 11:58:29,865 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10255) 2022-06-04 11:58:29,865 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=10193) 2022-06-04 11:58:32,657  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-28/run_001
(BaseWorkerMixin pid=10363) 2022-06-04 11:58:32,642 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=10363) 2022-06-04 11:58:32,679 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10363) 2022-06-04 11:58:32,680 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10364) 2022-06-04 11:58:32,642 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10364) 2022-06-04 11:58:32,679 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10364) 2022-06-04 11:58:32,679 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=10464) 2022-06-04 11:58:35,653  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-35
(TrainTrainable pid=10475) 2022-06-04 11:58:36,656  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-36
(TrainTrainable pid=10464) 2022-06-04 11:58:39,492  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-35/run_001
(BaseWorkerMixin pid=10597) 2022-06-04 11:58:39,478 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10596) 2022-06-04 11:58:39,415 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=10597) 2022-06-04 11:58:39,525 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10597) 2022-06-04 11:58:39,525 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10596) 2022-06-04 11:58:39,522 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10596) 2022-06-04 11:58:39,522 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10655) 2022-06-04 11:58:40,462 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10654) 2022-06-04 11:58:40,467 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=10475) 2022-06-04 11:58:41,488  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-36/run_001
(BaseWorkerMixin pid=10655) 2022-06-04 11:58:41,509 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10655) 2022-06-04 11:58:41,510 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10654) 2022-06-04 11:58:41,509 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10654) 2022-06-04 11:58:41,510 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=10769) 2022-06-04 11:58:43,806  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-43
(TrainTrainable pid=10813) 2022-06-04 11:58:45,894  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-45
(BaseWorkerMixin pid=10881) 2022-06-04 11:58:47,244 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10880) 2022-06-04 11:58:47,252 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=10769) 2022-06-04 11:58:48,266  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-43/run_001
(BaseWorkerMixin pid=10881) 2022-06-04 11:58:48,300 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10881) 2022-06-04 11:58:48,301 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10880) 2022-06-04 11:58:48,300 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10880) 2022-06-04 11:58:48,301 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10971) 2022-06-04 11:58:49,109 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=10813) 2022-06-04 11:58:49,146  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-45/run_001
(BaseWorkerMixin pid=10972) 2022-06-04 11:58:49,132 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10972) 2022-06-04 11:58:49,166 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10972) 2022-06-04 11:58:49,167 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10971) 2022-06-04 11:58:49,166 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10971) 2022-06-04 11:58:49,166 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=11092) 2022-06-04 11:58:53,164  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-53
(TrainTrainable pid=11101) 2022-06-04 11:58:53,865  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-53
(BaseWorkerMixin pid=11223) 2022-06-04 11:58:56,633 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=11225) 2022-06-04 11:58:56,699 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(TrainTrainable pid=11092) 2022-06-04 11:58:56,741  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-53/run_001
(BaseWorkerMixin pid=11223) 2022-06-04 11:58:56,783 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11223) 2022-06-04 11:58:56,784 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11225) 2022-06-04 11:58:56,775 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11225) 2022-06-04 11:58:56,775 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11276) 2022-06-04 11:58:57,128 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=11101) 2022-06-04 11:58:57,226  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-53/run_001
(BaseWorkerMixin pid=11277) 2022-06-04 11:58:57,162 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=11277) 2022-06-04 11:58:57,246 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11277) 2022-06-04 11:58:57,247 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11276) 2022-06-04 11:58:57,246 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11276) 2022-06-04 11:58:57,246 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=11432) 2022-06-04 11:59:02,095  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-59-02
(TrainTrainable pid=11434) 2022-06-04 11:59:02,090  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-59-02
(BaseWorkerMixin pid=11567) 2022-06-04 11:59:05,424 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=11566) 2022-06-04 11:59:05,418 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=11434) 2022-06-04 11:59:05,453  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-59-02/run_001
(BaseWorkerMixin pid=11567) 2022-06-04 11:59:05,473 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11567) 2022-06-04 11:59:05,474 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11566) 2022-06-04 11:59:05,473 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11566) 2022-06-04 11:59:05,473 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11573) 2022-06-04 11:59:05,458 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=11572) 2022-06-04 11:59:05,460 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=11432) 2022-06-04 11:59:06,473  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-59-02/run_001
(BaseWorkerMixin pid=11573) 2022-06-04 11:59:06,494 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11573) 2022-06-04 11:59:06,494 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11572) 2022-06-04 11:59:06,493 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11572) 2022-06-04 11:59:06,494 INFO torch.py:135 -- Wrapping provided model in DDP.
2022-06-04 11:59:07,584 INFO tune.py:741 -- Total run time: 43.99 seconds (43.81 seconds for the tuning loop).
                   loss  _timestamp  _time_this_iter_s  _training_iteration  \
trial_id                                                                      
4efbb_00000         NaN  1654369109           0.023204                    3   
4efbb_00001  104.306430  1654369112           0.027761                    3   
4efbb_00002    3.784865  1654369119           0.025730                    3   
4efbb_00003  541.657377  1654369121           0.043916                    3   
4efbb_00004  125.506339  1654369128           0.026837                    3   
4efbb_00005  277.940789  1654369129           0.022659                    3   
4efbb_00006    9.672665  1654369136           0.034569                    3   
4efbb_00007  137.866315  1654369137           0.022057                    3   
4efbb_00008  481.795680  1654369146           0.026306                    3   
4efbb_00009    5.880573  1654369145           0.021298                    3   

             time_this_iter_s  done timesteps_total episodes_total  \
trial_id                                                             
4efbb_00000          0.022866  True            None           None   
4efbb_00001          0.027816  True            None           None   
4efbb_00002          0.025495  True            None           None   
4efbb_00003          0.041988  True            None           None   
4efbb_00004          0.042421  True            None           None   
4efbb_00005          0.027935  True            None           None   
4efbb_00006          0.033077  True            None           None   
4efbb_00007          0.021848  True            None           None   
4efbb_00008          0.025594  True            None           None   
4efbb_00009          0.021665  True            None           None   

             training_iteration                     experiment_id  ...  \
trial_id                                                           ...   
4efbb_00000                   3  e9dcf570f30341c39d6e57b29f0584e1  ...   
4efbb_00001                   3  8d1431dbfd1143eaa99e0840c77e0fb2  ...   
4efbb_00002                   3  cea3d540ae0f4882a0c30e275cea25dd  ...   
4efbb_00003                   3  aa11d27f351b42748d9780b997e45765  ...   
4efbb_00004                   3  49d43eaefef3460bb444ac740690e5f5  ...   
4efbb_00005                   3  1134263520cc49039d4d7575352cbe84  ...   
4efbb_00006                   3  442ddfcfbd7145368535b8fa54fed340  ...   
4efbb_00007                   3  560212d41c0a429b863ced7e677ba36f  ...   
4efbb_00008                   3  0cf95a9f00f841a0925e7b2feba37238  ...   
4efbb_00009                   3  dfe055744a0f4c7fad704743d6fd4922  ...   

                  node_ip  time_since_restore  timesteps_since_restore  \
trial_id                                                                 
4efbb_00000  172.31.85.84            3.598558                        0   
4efbb_00001  172.31.85.84            4.199704                        0   
4efbb_00002  172.31.85.84            3.957810                        0   
4efbb_00003  172.31.85.84            4.955879                        0   
4efbb_00004  172.31.85.84            4.603604                        0   
4efbb_00005  172.31.85.84            3.345599                        0   
4efbb_00006  172.31.85.84            3.723681                        0   
4efbb_00007  172.31.85.84            3.460354                        0   
4efbb_00008  172.31.85.84            4.490146                        0   
4efbb_00009  172.31.85.84            3.452429                        0   

             iterations_since_restore warmup_time  \
trial_id                                            
4efbb_00000                         3    0.003227   
4efbb_00001                         3    0.002933   
4efbb_00002                         3    0.003157   
4efbb_00003                         3    0.003367   
4efbb_00004                         3    0.002966   
4efbb_00005                         3    0.003413   
4efbb_00006                         3    0.004617   
4efbb_00007                         3    0.003315   
4efbb_00008                         3    0.002941   
4efbb_00009                         3    0.003645   

                          experiment_tag  model/module.weight  \
trial_id                                                        
4efbb_00000  0_lr=0.3698,momentum=0.1422      [[tensor(nan)]]   
4efbb_00001  1_lr=0.0000,momentum=0.7394   [[tensor(0.9652)]]   
4efbb_00002  2_lr=0.0121,momentum=0.1462   [[tensor(2.3552)]]   
4efbb_00003  3_lr=0.0000,momentum=0.3054  [[tensor(-0.9554)]]   
4efbb_00004  4_lr=0.0001,momentum=0.1023   [[tensor(0.9996)]]   
4efbb_00005  5_lr=0.0000,momentum=0.3155   [[tensor(0.0530)]]   
4efbb_00006  6_lr=0.0007,momentum=0.7753   [[tensor(2.4362)]]   
4efbb_00007  7_lr=0.0002,momentum=0.6102   [[tensor(0.7600)]]   
4efbb_00008  8_lr=0.0000,momentum=0.5501  [[tensor(-0.7799)]]   
4efbb_00009  9_lr=0.0007,momentum=0.1651   [[tensor(2.2419)]]   

             model/module.bias     config/lr  config/momentum  
trial_id                                                       
4efbb_00000      [tensor(nan)]  3.698384e-01         0.142184  
4efbb_00001   [tensor(0.9858)]  5.022303e-09         0.739362  
4efbb_00002   [tensor(1.4479)]  1.212922e-02         0.146240  
4efbb_00003  [tensor(-0.1643)]  3.371647e-06         0.305439  
4efbb_00004  [tensor(-0.2717)]  6.919282e-05         0.102255  
4efbb_00005   [tensor(0.1515)]  1.201724e-08         0.315512  
4efbb_00006  [tensor(-0.1940)]  7.227677e-04         0.775255  
4efbb_00007   [tensor(0.7147)]  1.679175e-04         0.610192  
4efbb_00008   [tensor(0.0790)]  8.951328e-06         0.550088  
4efbb_00009   [tensor(1.3662)]  7.268414e-04         0.165074  

the experimental files are

(base) ray@ip-172-31-85-84:~/workspace-project-JimFixGithubIssue$ ls /home/ray/ray_results/github_issue
 basic-variant-state-2022-06-04_11-53-42.json
 basic-variant-state-2022-06-04_11-54-56.json
 basic-variant-state-2022-06-04_11-56-35.json
 basic-variant-state-2022-06-04_11-57-11.json
 basic-variant-state-2022-06-04_11-58-23.json
 experiment_state-2022-06-04_11-53-42.json
 experiment_state-2022-06-04_11-54-56.json
 experiment_state-2022-06-04_11-56-35.json
 experiment_state-2022-06-04_11-57-11.json
 experiment_state-2022-06-04_11-58-23.json
'tune_function_0e446_00000_0_lr=0.0012,momentum=0.2139_2022-06-04_11-56-35'
'tune_function_23e53_00000_0_lr=0.0000,momentum=0.3688_2022-06-04_11-57-11'
'tune_function_4efbb_00000_0_lr=0.3698,momentum=0.1422_2022-06-04_11-58-23'
'tune_function_4efbb_00001_1_lr=0.0000,momentum=0.7394_2022-06-04_11-58-26'
'tune_function_4efbb_00002_2_lr=0.0121,momentum=0.1462_2022-06-04_11-58-33'
'tune_function_4efbb_00003_3_lr=0.0000,momentum=0.3054_2022-06-04_11-58-34'
'tune_function_4efbb_00004_4_lr=0.0001,momentum=0.1023_2022-06-04_11-58-41'
'tune_function_4efbb_00005_5_lr=0.0000,momentum=0.3155_2022-06-04_11-58-43'
'tune_function_4efbb_00006_6_lr=0.0007,momentum=0.7753_2022-06-04_11-58-50'
'tune_function_4efbb_00007_7_lr=0.0002,momentum=0.6102_2022-06-04_11-58-51'
'tune_function_4efbb_00008_8_lr=0.0000,momentum=0.5501_2022-06-04_11-58-59'
'tune_function_4efbb_00009_9_lr=0.0007,momentum=0.1651_2022-06-04_11-58-59'
'tune_function_a728b_00000_0_lr=0.0000,momentum=0.1061_2022-06-04_11-53-43'
'tune_function_d3662_00000_0_lr=0.0000,momentum=0.8638_2022-06-04_11-54-56'
(base) ray@ip-172-31-85-84:~/workspace-project-JimFixGithubIssue$ ls /home/ray/ray_results/abc/
JiahaoYao commented 2 years ago
(TrainTrainable pid=18092) 2022-06-04 12:15:39,771  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18132) 2022-06-04 12:15:42,043  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18092) /home/ray/ray_results/abc None
(TrainTrainable pid=18092) 2022-06-04 12:15:42,882  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18194) 2022-06-04 12:15:42,866 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=18194) 2022-06-04 12:15:42,908 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18194) 2022-06-04 12:15:42,909 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18195) 2022-06-04 12:15:42,868 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18195) 2022-06-04 12:15:42,909 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18195) 2022-06-04 12:15:42,910 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18298) 2022-06-04 12:15:45,269 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=18299) 2022-06-04 12:15:45,241 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(TrainTrainable pid=18132) /home/ray/ray_results/abc None
(TrainTrainable pid=18132) 2022-06-04 12:15:46,290  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18298) 2022-06-04 12:15:46,311 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18298) 2022-06-04 12:15:46,311 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18299) 2022-06-04 12:15:46,311 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18299) 2022-06-04 12:15:46,311 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=18355) 2022-06-04 12:15:46,722  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18418) 2022-06-04 12:15:49,573  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18355) /home/ray/ray_results/abc None
(TrainTrainable pid=18355) 2022-06-04 12:15:49,796  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18450) 2022-06-04 12:15:49,769 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18450) 2022-06-04 12:15:49,824 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18450) 2022-06-04 12:15:49,825 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18449) 2022-06-04 12:15:49,722 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=18449) 2022-06-04 12:15:49,828 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18449) 2022-06-04 12:15:49,828 INFO torch.py:135 -- Wrapping provided model in DDP.
WARNING:root:NaN or Inf found in input tensor.
(BaseWorkerMixin pid=18591) 2022-06-04 12:15:53,003 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18590) 2022-06-04 12:15:53,065 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=18595) 2022-06-04 12:15:53,956  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18418) /home/ray/ray_results/abc None
(TrainTrainable pid=18418) 2022-06-04 12:15:54,018  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18590) 2022-06-04 12:15:54,038 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18590) 2022-06-04 12:15:54,038 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18591) 2022-06-04 12:15:54,038 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18591) 2022-06-04 12:15:54,038 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18736) 2022-06-04 12:15:56,994 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18735) 2022-06-04 12:15:57,028 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=18731) 2022-06-04 12:15:57,640  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18595) 2022-06-04 12:15:58,014  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(TrainTrainable pid=18595) /home/ray/ray_results/abc None
(BaseWorkerMixin pid=18736) 2022-06-04 12:15:58,053 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18736) 2022-06-04 12:15:58,054 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18735) 2022-06-04 12:15:58,053 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18735) 2022-06-04 12:15:58,054 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=18731) /home/ray/ray_results/abc None
(TrainTrainable pid=18731) 2022-06-04 12:16:01,072  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18891) 2022-06-04 12:16:01,057 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18891) 2022-06-04 12:16:01,092 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18891) 2022-06-04 12:16:01,092 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18890) 2022-06-04 12:16:01,037 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=18890) 2022-06-04 12:16:01,092 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18890) 2022-06-04 12:16:01,092 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=18886) 2022-06-04 12:16:01,674  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=19020) 2022-06-04 12:16:04,661  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(BaseWorkerMixin pid=19033) 2022-06-04 12:16:04,733 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=19032) 2022-06-04 12:16:04,736 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=18886) 2022-06-04 12:16:05,753  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(TrainTrainable pid=18886) /home/ray/ray_results/abc None
(BaseWorkerMixin pid=19032) 2022-06-04 12:16:05,792 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19032) 2022-06-04 12:16:05,793 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19033) 2022-06-04 12:16:05,792 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19033) 2022-06-04 12:16:05,793 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19170) 2022-06-04 12:16:08,105 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=19169) 2022-06-04 12:16:08,110 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=19020) /home/ray/ray_results/abc None
(TrainTrainable pid=19020) 2022-06-04 12:16:09,121  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=19170) 2022-06-04 12:16:09,141 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19170) 2022-06-04 12:16:09,141 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19169) 2022-06-04 12:16:09,141 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19169) 2022-06-04 12:16:09,141 INFO torch.py:135 -- Wrapping provided model in DDP.
WARNING:root:NaN or Inf found in input tensor.
(TrainTrainable pid=19224) 2022-06-04 12:16:09,572  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=19310) 2022-06-04 12:16:12,730  INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=19224) /home/ray/ray_results/abc None
(TrainTrainable pid=19224) 2022-06-04 12:16:12,816  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=19318) 2022-06-04 12:16:12,802 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=19318) 2022-06-04 12:16:12,836 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19318) 2022-06-04 12:16:12,837 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19319) 2022-06-04 12:16:12,802 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=19319) 2022-06-04 12:16:12,837 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19319) 2022-06-04 12:16:12,837 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19454) 2022-06-04 12:16:15,886 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=19453) 2022-06-04 12:16:15,907 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=19310) 2022-06-04 12:16:16,906  INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(TrainTrainable pid=19310) /home/ray/ray_results/abc None
(BaseWorkerMixin pid=19453) 2022-06-04 12:16:16,926 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19453) 2022-06-04 12:16:16,927 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19454) 2022-06-04 12:16:16,927 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19454) 2022-06-04 12:16:16,927 INFO torch.py:135 -- Wrapping provided model in DDP.
2022-06-04 12:16:18,021 INFO tune.py:741 -- Total run time: 42.32 seconds (42.07 seconds for the tuning loop).

And the run_id is also the issue, they might overwrite each other when given the dir name.

amogkam commented 2 years ago

Thanks for reporting this @VishDev12! @VishDev12 @peytondmurray would one of you be willing to make a PR to pass the correct variable through? I or @JiahaoYao would be happy to help shepherd it in!

peytondmurray commented 2 years ago

@amogkam I'd be happy to make a PR here, but after I posted my comment above it looked like @JiahaoYao made a PR before me: https://github.com/ray-project/ray/pull/25483. Let me know how you'd like to proceed - if you still need a PR, I'd be more than willing to provide it.

JiahaoYao commented 2 years ago

Hi @VishDev12 @peytondmurray @amogkam

it seems that in the beginning, the ray train will create the folder

(TrainTrainable pid=38829) 2022-06-06 18:10:38,717      INFO trainer.py:244 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-06_18-10-38
(TrainTrainable pid=38829) 2022-06-06 18:10:43,014      INFO trainer.py:250 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-06_18-10-38/run_001

And, at the end,


== Status ==
Current time: 2022-06-06 18:10:55 (running for 00:00:22.07)
Memory usage on this node: 7.9/186.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/48 CPUs, 0/4 GPUs, 0.0/120.81 GiB heap, 0.0/55.77 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /home/ray/ray_results/tune_function_2022-06-06_18-10-32
Number of trials: 1/1 (1 TERMINATED)

And it seems that all the data are redirected to the tune directory.

Then, instead of creating ray train's directory, do u guys feel good if to disable the train's directory when it is passed to ray tune? @VishDev12 @peytondmurray

VishDev12 commented 2 years ago

That sounds good to me! Because even if the fix is made to pass the logdir to _create_tune_trainable, there would just be an unused run_001 folder created inside the logdir. So disabling the folder creation completely from the Trainer inside _create_tune_trainable => tune_function would be a perfect solution if that's possible.

JiahaoYao commented 2 years ago

Thanks @VishDev12, sgtm!

richardliaw commented 1 year ago

This isn't fixed right?