Open VishDev12 opened 2 years ago
Hi @VishDev12, thanks for reporting this, and for the detailed description of the issue. I'd be happy to look into implementing the change needed to propagate the requested logging directory so that the logs appear where the user expects to see it.
i am able to reproduce the issue
import argparse
import numpy as np
import torch
import torch.nn as nn
import ray.train as train
from ray.train import Trainer
from ray.train.callbacks import JsonLoggerCallback, TBXLoggerCallback
class LinearDataset(torch.utils.data.Dataset):
"""y = a * x + b"""
def __init__(self, a, b, size=1000):
x = np.arange(0, 10, 10 / size, dtype=np.float32)
self.x = torch.from_numpy(x)
self.y = torch.from_numpy(a * x + b)
def __getitem__(self, index):
return self.x[index, None], self.y[index, None]
def __len__(self):
return len(self.x)
def train_epoch(dataloader, model, loss_fn, optimizer):
for X, y in dataloader:
# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
def validate_epoch(dataloader, model, loss_fn):
num_batches = len(dataloader)
model.eval()
loss = 0
with torch.no_grad():
for X, y in dataloader:
pred = model(X)
loss += loss_fn(pred, y).item()
loss /= num_batches
import copy
model_copy = copy.deepcopy(model)
result = {"model": model_copy.cpu().state_dict(), "loss": loss}
return result
def train_func(config):
data_size = config.get("data_size", 1000)
val_size = config.get("val_size", 400)
batch_size = config.get("batch_size", 32)
hidden_size = config.get("hidden_size", 1)
lr = config.get("lr", 1e-2)
epochs = config.get("epochs", 3)
train_dataset = LinearDataset(2, 5, size=data_size)
val_dataset = LinearDataset(2, 5, size=val_size)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size)
validation_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size)
train_loader = train.torch.prepare_data_loader(train_loader)
validation_loader = train.torch.prepare_data_loader(validation_loader)
model = nn.Linear(1, hidden_size)
model = train.torch.prepare_model(model)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
results = []
for _ in range(epochs):
train_epoch(train_loader, model, loss_fn, optimizer)
result = validate_epoch(validation_loader, model, loss_fn)
train.report(**result)
results.append(result)
return results
def train_linear(num_workers=2, use_gpu=False, epochs=3):
trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=use_gpu, logdir='abc')
config = {"lr": 1e-2, "hidden_size": 1, "batch_size": 4, "epochs": epochs}
from ray import tune
trainable = trainer.to_tune_trainable(train_func)
search_space = {
"lr": tune.sample_from(lambda spec: 10 ** (-10 * np.random.rand())),
"momentum": tune.uniform(0.1, 0.9),
}
analysis = tune.run(
trainable,
name='github_issue',
config=search_space,
num_samples=10,
verbose=1,
)
print(analysis.results_df)
return 0
import ray
ray.init('auto', ignore_reinit_error=True)
train_linear(
num_workers=2, use_gpu=False, epochs=10
)
the logs are
(TrainTrainable pid=10125) 2022-06-04 11:58:26,340 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-26
(TrainTrainable pid=10193) 2022-06-04 11:58:28,557 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-28
(TrainTrainable pid=10125) 2022-06-04 11:58:29,842 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-26/run_001
(BaseWorkerMixin pid=10254) 2022-06-04 11:58:29,818 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=10255) 2022-06-04 11:58:29,821 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
WARNING:root:NaN or Inf found in input tensor.
(BaseWorkerMixin pid=10254) 2022-06-04 11:58:29,864 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10254) 2022-06-04 11:58:29,865 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10255) 2022-06-04 11:58:29,865 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10255) 2022-06-04 11:58:29,865 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=10193) 2022-06-04 11:58:32,657 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-28/run_001
(BaseWorkerMixin pid=10363) 2022-06-04 11:58:32,642 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=10363) 2022-06-04 11:58:32,679 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10363) 2022-06-04 11:58:32,680 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10364) 2022-06-04 11:58:32,642 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10364) 2022-06-04 11:58:32,679 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10364) 2022-06-04 11:58:32,679 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=10464) 2022-06-04 11:58:35,653 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-35
(TrainTrainable pid=10475) 2022-06-04 11:58:36,656 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-36
(TrainTrainable pid=10464) 2022-06-04 11:58:39,492 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-35/run_001
(BaseWorkerMixin pid=10597) 2022-06-04 11:58:39,478 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10596) 2022-06-04 11:58:39,415 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=10597) 2022-06-04 11:58:39,525 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10597) 2022-06-04 11:58:39,525 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10596) 2022-06-04 11:58:39,522 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10596) 2022-06-04 11:58:39,522 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10655) 2022-06-04 11:58:40,462 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10654) 2022-06-04 11:58:40,467 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=10475) 2022-06-04 11:58:41,488 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-36/run_001
(BaseWorkerMixin pid=10655) 2022-06-04 11:58:41,509 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10655) 2022-06-04 11:58:41,510 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10654) 2022-06-04 11:58:41,509 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10654) 2022-06-04 11:58:41,510 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=10769) 2022-06-04 11:58:43,806 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-43
(TrainTrainable pid=10813) 2022-06-04 11:58:45,894 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-45
(BaseWorkerMixin pid=10881) 2022-06-04 11:58:47,244 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10880) 2022-06-04 11:58:47,252 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=10769) 2022-06-04 11:58:48,266 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-43/run_001
(BaseWorkerMixin pid=10881) 2022-06-04 11:58:48,300 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10881) 2022-06-04 11:58:48,301 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10880) 2022-06-04 11:58:48,300 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10880) 2022-06-04 11:58:48,301 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10971) 2022-06-04 11:58:49,109 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=10813) 2022-06-04 11:58:49,146 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-45/run_001
(BaseWorkerMixin pid=10972) 2022-06-04 11:58:49,132 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=10972) 2022-06-04 11:58:49,166 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10972) 2022-06-04 11:58:49,167 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=10971) 2022-06-04 11:58:49,166 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=10971) 2022-06-04 11:58:49,166 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=11092) 2022-06-04 11:58:53,164 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-53
(TrainTrainable pid=11101) 2022-06-04 11:58:53,865 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-53
(BaseWorkerMixin pid=11223) 2022-06-04 11:58:56,633 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=11225) 2022-06-04 11:58:56,699 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(TrainTrainable pid=11092) 2022-06-04 11:58:56,741 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-53/run_001
(BaseWorkerMixin pid=11223) 2022-06-04 11:58:56,783 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11223) 2022-06-04 11:58:56,784 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11225) 2022-06-04 11:58:56,775 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11225) 2022-06-04 11:58:56,775 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11276) 2022-06-04 11:58:57,128 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=11101) 2022-06-04 11:58:57,226 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-58-53/run_001
(BaseWorkerMixin pid=11277) 2022-06-04 11:58:57,162 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=11277) 2022-06-04 11:58:57,246 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11277) 2022-06-04 11:58:57,247 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11276) 2022-06-04 11:58:57,246 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11276) 2022-06-04 11:58:57,246 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=11432) 2022-06-04 11:59:02,095 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-59-02
(TrainTrainable pid=11434) 2022-06-04 11:59:02,090 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_11-59-02
(BaseWorkerMixin pid=11567) 2022-06-04 11:59:05,424 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=11566) 2022-06-04 11:59:05,418 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=11434) 2022-06-04 11:59:05,453 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-59-02/run_001
(BaseWorkerMixin pid=11567) 2022-06-04 11:59:05,473 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11567) 2022-06-04 11:59:05,474 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11566) 2022-06-04 11:59:05,473 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11566) 2022-06-04 11:59:05,473 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11573) 2022-06-04 11:59:05,458 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=11572) 2022-06-04 11:59:05,460 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=11432) 2022-06-04 11:59:06,473 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_11-59-02/run_001
(BaseWorkerMixin pid=11573) 2022-06-04 11:59:06,494 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11573) 2022-06-04 11:59:06,494 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=11572) 2022-06-04 11:59:06,493 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=11572) 2022-06-04 11:59:06,494 INFO torch.py:135 -- Wrapping provided model in DDP.
2022-06-04 11:59:07,584 INFO tune.py:741 -- Total run time: 43.99 seconds (43.81 seconds for the tuning loop).
loss _timestamp _time_this_iter_s _training_iteration \
trial_id
4efbb_00000 NaN 1654369109 0.023204 3
4efbb_00001 104.306430 1654369112 0.027761 3
4efbb_00002 3.784865 1654369119 0.025730 3
4efbb_00003 541.657377 1654369121 0.043916 3
4efbb_00004 125.506339 1654369128 0.026837 3
4efbb_00005 277.940789 1654369129 0.022659 3
4efbb_00006 9.672665 1654369136 0.034569 3
4efbb_00007 137.866315 1654369137 0.022057 3
4efbb_00008 481.795680 1654369146 0.026306 3
4efbb_00009 5.880573 1654369145 0.021298 3
time_this_iter_s done timesteps_total episodes_total \
trial_id
4efbb_00000 0.022866 True None None
4efbb_00001 0.027816 True None None
4efbb_00002 0.025495 True None None
4efbb_00003 0.041988 True None None
4efbb_00004 0.042421 True None None
4efbb_00005 0.027935 True None None
4efbb_00006 0.033077 True None None
4efbb_00007 0.021848 True None None
4efbb_00008 0.025594 True None None
4efbb_00009 0.021665 True None None
training_iteration experiment_id ... \
trial_id ...
4efbb_00000 3 e9dcf570f30341c39d6e57b29f0584e1 ...
4efbb_00001 3 8d1431dbfd1143eaa99e0840c77e0fb2 ...
4efbb_00002 3 cea3d540ae0f4882a0c30e275cea25dd ...
4efbb_00003 3 aa11d27f351b42748d9780b997e45765 ...
4efbb_00004 3 49d43eaefef3460bb444ac740690e5f5 ...
4efbb_00005 3 1134263520cc49039d4d7575352cbe84 ...
4efbb_00006 3 442ddfcfbd7145368535b8fa54fed340 ...
4efbb_00007 3 560212d41c0a429b863ced7e677ba36f ...
4efbb_00008 3 0cf95a9f00f841a0925e7b2feba37238 ...
4efbb_00009 3 dfe055744a0f4c7fad704743d6fd4922 ...
node_ip time_since_restore timesteps_since_restore \
trial_id
4efbb_00000 172.31.85.84 3.598558 0
4efbb_00001 172.31.85.84 4.199704 0
4efbb_00002 172.31.85.84 3.957810 0
4efbb_00003 172.31.85.84 4.955879 0
4efbb_00004 172.31.85.84 4.603604 0
4efbb_00005 172.31.85.84 3.345599 0
4efbb_00006 172.31.85.84 3.723681 0
4efbb_00007 172.31.85.84 3.460354 0
4efbb_00008 172.31.85.84 4.490146 0
4efbb_00009 172.31.85.84 3.452429 0
iterations_since_restore warmup_time \
trial_id
4efbb_00000 3 0.003227
4efbb_00001 3 0.002933
4efbb_00002 3 0.003157
4efbb_00003 3 0.003367
4efbb_00004 3 0.002966
4efbb_00005 3 0.003413
4efbb_00006 3 0.004617
4efbb_00007 3 0.003315
4efbb_00008 3 0.002941
4efbb_00009 3 0.003645
experiment_tag model/module.weight \
trial_id
4efbb_00000 0_lr=0.3698,momentum=0.1422 [[tensor(nan)]]
4efbb_00001 1_lr=0.0000,momentum=0.7394 [[tensor(0.9652)]]
4efbb_00002 2_lr=0.0121,momentum=0.1462 [[tensor(2.3552)]]
4efbb_00003 3_lr=0.0000,momentum=0.3054 [[tensor(-0.9554)]]
4efbb_00004 4_lr=0.0001,momentum=0.1023 [[tensor(0.9996)]]
4efbb_00005 5_lr=0.0000,momentum=0.3155 [[tensor(0.0530)]]
4efbb_00006 6_lr=0.0007,momentum=0.7753 [[tensor(2.4362)]]
4efbb_00007 7_lr=0.0002,momentum=0.6102 [[tensor(0.7600)]]
4efbb_00008 8_lr=0.0000,momentum=0.5501 [[tensor(-0.7799)]]
4efbb_00009 9_lr=0.0007,momentum=0.1651 [[tensor(2.2419)]]
model/module.bias config/lr config/momentum
trial_id
4efbb_00000 [tensor(nan)] 3.698384e-01 0.142184
4efbb_00001 [tensor(0.9858)] 5.022303e-09 0.739362
4efbb_00002 [tensor(1.4479)] 1.212922e-02 0.146240
4efbb_00003 [tensor(-0.1643)] 3.371647e-06 0.305439
4efbb_00004 [tensor(-0.2717)] 6.919282e-05 0.102255
4efbb_00005 [tensor(0.1515)] 1.201724e-08 0.315512
4efbb_00006 [tensor(-0.1940)] 7.227677e-04 0.775255
4efbb_00007 [tensor(0.7147)] 1.679175e-04 0.610192
4efbb_00008 [tensor(0.0790)] 8.951328e-06 0.550088
4efbb_00009 [tensor(1.3662)] 7.268414e-04 0.165074
the experimental files are
(base) ray@ip-172-31-85-84:~/workspace-project-JimFixGithubIssue$ ls /home/ray/ray_results/github_issue
basic-variant-state-2022-06-04_11-53-42.json
basic-variant-state-2022-06-04_11-54-56.json
basic-variant-state-2022-06-04_11-56-35.json
basic-variant-state-2022-06-04_11-57-11.json
basic-variant-state-2022-06-04_11-58-23.json
experiment_state-2022-06-04_11-53-42.json
experiment_state-2022-06-04_11-54-56.json
experiment_state-2022-06-04_11-56-35.json
experiment_state-2022-06-04_11-57-11.json
experiment_state-2022-06-04_11-58-23.json
'tune_function_0e446_00000_0_lr=0.0012,momentum=0.2139_2022-06-04_11-56-35'
'tune_function_23e53_00000_0_lr=0.0000,momentum=0.3688_2022-06-04_11-57-11'
'tune_function_4efbb_00000_0_lr=0.3698,momentum=0.1422_2022-06-04_11-58-23'
'tune_function_4efbb_00001_1_lr=0.0000,momentum=0.7394_2022-06-04_11-58-26'
'tune_function_4efbb_00002_2_lr=0.0121,momentum=0.1462_2022-06-04_11-58-33'
'tune_function_4efbb_00003_3_lr=0.0000,momentum=0.3054_2022-06-04_11-58-34'
'tune_function_4efbb_00004_4_lr=0.0001,momentum=0.1023_2022-06-04_11-58-41'
'tune_function_4efbb_00005_5_lr=0.0000,momentum=0.3155_2022-06-04_11-58-43'
'tune_function_4efbb_00006_6_lr=0.0007,momentum=0.7753_2022-06-04_11-58-50'
'tune_function_4efbb_00007_7_lr=0.0002,momentum=0.6102_2022-06-04_11-58-51'
'tune_function_4efbb_00008_8_lr=0.0000,momentum=0.5501_2022-06-04_11-58-59'
'tune_function_4efbb_00009_9_lr=0.0007,momentum=0.1651_2022-06-04_11-58-59'
'tune_function_a728b_00000_0_lr=0.0000,momentum=0.1061_2022-06-04_11-53-43'
'tune_function_d3662_00000_0_lr=0.0000,momentum=0.8638_2022-06-04_11-54-56'
(base) ray@ip-172-31-85-84:~/workspace-project-JimFixGithubIssue$ ls /home/ray/ray_results/abc/
(TrainTrainable pid=18092) 2022-06-04 12:15:39,771 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18132) 2022-06-04 12:15:42,043 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18092) /home/ray/ray_results/abc None
(TrainTrainable pid=18092) 2022-06-04 12:15:42,882 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18194) 2022-06-04 12:15:42,866 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=18194) 2022-06-04 12:15:42,908 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18194) 2022-06-04 12:15:42,909 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18195) 2022-06-04 12:15:42,868 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18195) 2022-06-04 12:15:42,909 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18195) 2022-06-04 12:15:42,910 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18298) 2022-06-04 12:15:45,269 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=18299) 2022-06-04 12:15:45,241 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(TrainTrainable pid=18132) /home/ray/ray_results/abc None
(TrainTrainable pid=18132) 2022-06-04 12:15:46,290 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18298) 2022-06-04 12:15:46,311 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18298) 2022-06-04 12:15:46,311 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18299) 2022-06-04 12:15:46,311 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18299) 2022-06-04 12:15:46,311 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=18355) 2022-06-04 12:15:46,722 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18418) 2022-06-04 12:15:49,573 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18355) /home/ray/ray_results/abc None
(TrainTrainable pid=18355) 2022-06-04 12:15:49,796 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18450) 2022-06-04 12:15:49,769 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18450) 2022-06-04 12:15:49,824 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18450) 2022-06-04 12:15:49,825 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18449) 2022-06-04 12:15:49,722 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=18449) 2022-06-04 12:15:49,828 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18449) 2022-06-04 12:15:49,828 INFO torch.py:135 -- Wrapping provided model in DDP.
WARNING:root:NaN or Inf found in input tensor.
(BaseWorkerMixin pid=18591) 2022-06-04 12:15:53,003 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18590) 2022-06-04 12:15:53,065 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=18595) 2022-06-04 12:15:53,956 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18418) /home/ray/ray_results/abc None
(TrainTrainable pid=18418) 2022-06-04 12:15:54,018 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18590) 2022-06-04 12:15:54,038 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18590) 2022-06-04 12:15:54,038 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18591) 2022-06-04 12:15:54,038 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18591) 2022-06-04 12:15:54,038 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18736) 2022-06-04 12:15:56,994 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18735) 2022-06-04 12:15:57,028 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=18731) 2022-06-04 12:15:57,640 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=18595) 2022-06-04 12:15:58,014 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(TrainTrainable pid=18595) /home/ray/ray_results/abc None
(BaseWorkerMixin pid=18736) 2022-06-04 12:15:58,053 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18736) 2022-06-04 12:15:58,054 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18735) 2022-06-04 12:15:58,053 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18735) 2022-06-04 12:15:58,054 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=18731) /home/ray/ray_results/abc None
(TrainTrainable pid=18731) 2022-06-04 12:16:01,072 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=18891) 2022-06-04 12:16:01,057 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=18891) 2022-06-04 12:16:01,092 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18891) 2022-06-04 12:16:01,092 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=18890) 2022-06-04 12:16:01,037 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=18890) 2022-06-04 12:16:01,092 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=18890) 2022-06-04 12:16:01,092 INFO torch.py:135 -- Wrapping provided model in DDP.
(TrainTrainable pid=18886) 2022-06-04 12:16:01,674 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=19020) 2022-06-04 12:16:04,661 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(BaseWorkerMixin pid=19033) 2022-06-04 12:16:04,733 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=19032) 2022-06-04 12:16:04,736 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=18886) 2022-06-04 12:16:05,753 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(TrainTrainable pid=18886) /home/ray/ray_results/abc None
(BaseWorkerMixin pid=19032) 2022-06-04 12:16:05,792 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19032) 2022-06-04 12:16:05,793 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19033) 2022-06-04 12:16:05,792 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19033) 2022-06-04 12:16:05,793 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19170) 2022-06-04 12:16:08,105 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=19169) 2022-06-04 12:16:08,110 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=19020) /home/ray/ray_results/abc None
(TrainTrainable pid=19020) 2022-06-04 12:16:09,121 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=19170) 2022-06-04 12:16:09,141 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19170) 2022-06-04 12:16:09,141 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19169) 2022-06-04 12:16:09,141 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19169) 2022-06-04 12:16:09,141 INFO torch.py:135 -- Wrapping provided model in DDP.
WARNING:root:NaN or Inf found in input tensor.
(TrainTrainable pid=19224) 2022-06-04 12:16:09,572 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=19310) 2022-06-04 12:16:12,730 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/abc
(TrainTrainable pid=19224) /home/ray/ray_results/abc None
(TrainTrainable pid=19224) 2022-06-04 12:16:12,816 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(BaseWorkerMixin pid=19318) 2022-06-04 12:16:12,802 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=19318) 2022-06-04 12:16:12,836 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19318) 2022-06-04 12:16:12,837 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19319) 2022-06-04 12:16:12,802 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=19319) 2022-06-04 12:16:12,837 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19319) 2022-06-04 12:16:12,837 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19454) 2022-06-04 12:16:15,886 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=19453) 2022-06-04 12:16:15,907 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
(TrainTrainable pid=19310) 2022-06-04 12:16:16,906 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/abc/run_001
(TrainTrainable pid=19310) /home/ray/ray_results/abc None
(BaseWorkerMixin pid=19453) 2022-06-04 12:16:16,926 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19453) 2022-06-04 12:16:16,927 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=19454) 2022-06-04 12:16:16,927 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=19454) 2022-06-04 12:16:16,927 INFO torch.py:135 -- Wrapping provided model in DDP.
2022-06-04 12:16:18,021 INFO tune.py:741 -- Total run time: 42.32 seconds (42.07 seconds for the tuning loop).
And the run_id
is also the issue, they might overwrite each other when given the dir name.
Thanks for reporting this @VishDev12! @VishDev12 @peytondmurray would one of you be willing to make a PR to pass the correct variable through? I or @JiahaoYao would be happy to help shepherd it in!
@amogkam I'd be happy to make a PR here, but after I posted my comment above it looked like @JiahaoYao made a PR before me: https://github.com/ray-project/ray/pull/25483. Let me know how you'd like to proceed - if you still need a PR, I'd be more than willing to provide it.
Hi @VishDev12 @peytondmurray @amogkam
it seems that in the beginning, the ray train
will create the folder
(TrainTrainable pid=38829) 2022-06-06 18:10:38,717 INFO trainer.py:244 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-06_18-10-38
(TrainTrainable pid=38829) 2022-06-06 18:10:43,014 INFO trainer.py:250 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-06_18-10-38/run_001
And, at the end,
== Status ==
Current time: 2022-06-06 18:10:55 (running for 00:00:22.07)
Memory usage on this node: 7.9/186.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/48 CPUs, 0/4 GPUs, 0.0/120.81 GiB heap, 0.0/55.77 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /home/ray/ray_results/tune_function_2022-06-06_18-10-32
Number of trials: 1/1 (1 TERMINATED)
And it seems that all the data are redirected to the tune directory.
Then, instead of creating ray train
's directory, do u guys feel good if to disable the train's directory when it is passed to ray tune
? @VishDev12 @peytondmurray
That sounds good to me! Because even if the fix is made to pass the logdir
to _create_tune_trainable
, there would just be an unused run_001
folder created inside the logdir
. So disabling the folder creation completely from the Trainer inside _create_tune_trainable
=> tune_function
would be a perfect solution if that's possible.
Thanks @VishDev12, sgtm!
This isn't fixed right?
What happened + What you expected to happen
When initializing a Ray Trainer, we provide a
logdir
argument, and the__init__
method of the Trainer stores it as alogdir
class variable.Then, when creating a Trainable with
Trainer.to_tune_trainable()
, it in-turn calls_create_tune_trainable()
, which does not useself.logdir
. So whentune_function
is defined inside_create_tune_trainable
with a Trainer initialization call, there's no logdir passed to it, and so, this Trainer ends up creating its own logdir in the default path~/ray_results
.https://github.com/ray-project/ray/blob/7f1bacc7dc9caf6d0ec042e39499bbf1d9a7d065/python/ray/train/trainer.py#L828-L843
This could be solved by passing
self.logdir
along to_create_tune_trainable
and using that in the Trainer initialization.Additionally, we also have the issue of the non-customizable run directory created by Ray Train as
run_<run_id>
. But since this directory is unused, it's not too much of an issue. And I believe it's partially being tracked here: https://github.com/ray-project/ray/issues/20807Versions / Dependencies
Python: 3.8.13 Ray: 1.12.1 Ubuntu 18.04.5
Reproduction script
This isn't a reproduction script but should serve as an indicator of the variables that are being passed in.
Notes
tune_trial_name_creator
function generates a unique name and uses theconfig
in theray.tune.trial.Trial
object that's passed in to store the generated trial_name.tune_dirname_creator
function simply returns thetrial_name
accessed from theTrial
object. This way, we use the exact trial_name as the directory name.local_dir
/experiment_name
/trial_name
.logdir
passed to theTrainer
islocal_dir
/experiment_name
.Issue Severity
Low