Open AlexMRuch opened 3 years ago
Follow-up: the freezing with ddp
isn't just with the learning rate optimizer, it also happens with the main training function:
# Stop training, when loss metric does not improve on validation set
early_stop_callback = EarlyStopping(
monitor = "val_loss",
min_delta = 1e-4,
patience = 10,
verbose = False,
mode = "min"
)
lr_logger = LearningRateLogger() # log the learning rate
logger = TensorBoardLogger("lightning_logs") # log to tensorboard
# Update trainer
trainer = pl.Trainer(
max_epochs = 500,
gpus = [0,1],
distributed_backend = 'ddp', # https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes
weights_summary = "top",
gradient_clip_val = 0.1,
early_stop_callback = early_stop_callback,
#limit_train_batches = 20, # comment in for training, running validation every 20 batches
#fast_dev_run=True, # comment in to check that network or dataset has no serious bugs
callbacks = [lr_logger],
logger = logger
)
# Update model
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate = res.suggestion(), #use res.suggestion() or manual input from lr_finder above (e.g., 0.03)
hidden_size = 16, # biggest influence network size
attention_head_size = 1,
dropout = 0.1,
hidden_continuous_size = 8,
output_size = 7, # QuantileLoss has 7 quantiles by default
loss=QuantileLoss(),
log_interval = 10, # log example every 10 batches
reduce_on_plateau_patience = 4, # reduce learning automatically
)
print(f"Number of parameters in transformer model: {tft.size()/1e3:.1f}k")
# Train model
trainer.fit(
tft,
train_dataloader = train_dataloader,
val_dataloaders = val_dataloader
)
Also, when I use ddp
in that context, only 1 of my 2 GPUs is recognized by the system, even when setting gpus = 2
or doing gpus = [0,1]
:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
Number of parameters in transformer model: 23.4k
Interesting, I have not tried that, yet. Let me look into that pickling error. It might be the main issue.
Could you try with the newest master? I hope some of the issues are directly fixed. Maybe log_interval=-1
also will help.
Thanks for all these updates!
With pytorch-forecasting=0.5.2
, torch.cuda.device_count()
returns 2
(the expected value); however, now pl.Trainer(gpus = [0,1])
and pl.Trainer(gpus = 2)
both return
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Running pl.Trainer(gpus=2, distributed_backend='ddp')
throws the same error. Really weird...
I think this is because of the recent upgrade to pytorch-lightning
> 1.0, as their docs for Trainer
have updated: https://pytorch-lightning.readthedocs.io/en/latest/trainer.html.
On the other hand, when I open Python and only load pytorch-lightning
and then directly run the code above it works:
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pytorch_lightning as pl
>>> pl.Trainer(gpus=2, distributed_backend='ddp')
/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:45: DeprecationWarning: distributed_backend has been renamed to accelerator. Deprecated in 1.0.0, will be removed in 1.2.0
warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
<pytorch_lightning.trainer.trainer.Trainer object at 0x7f7a2d7bf340>
When I try to initialize the Trainer
object right at the very beginning of the notebook, It loads fine but throws a future warning:
/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)
/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:45: DeprecationWarning: distributed_backend has been renamed to accelerator. Deprecated in 1.0.0, will be removed in 1.2.0
warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
I'll play around with the notebook and try to track down what's going on and will update you.
I don't know what kind of wizardry just happened, but after restarting my notebook the issue went away – pl.Trainer
worked just fine 🤯
For good measure I deleted the whole environment and recreated it from scratch.
Now, however, when I use multiple GPUs for hyperparameter optimization, the notebook freezes. Also, for whatever reason the model states that both GPUs are visible but only one of them is used for the hyperparameter optimization. Also, the hyperparameter optimization freezes with two GPUs.
In my terminal window that's running Jupyter Lab, I get
Traceback (most recent call last):
File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/traitlets/config/application.py", line 844, in launch_instance
app.initialize(argv)
File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
return method(app, *args, **kwargs)
File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 567, in initialize
self.init_sockets()
File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 271, in init_sockets
self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 218, in _bind_socket
return self._try_bind_socket(s, port)
File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 194, in _try_bind_socket
s.bind("tcp://%s:%i" % (self.ip, port))
File "zmq/backend/cython/socket.pyx", line 550, in zmq.backend.cython.socket.Socket.bind
File "zmq/backend/cython/checkrc.pxd", line 26, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use
Which I haven't gotten before and seems to be related to ipykernel
and not directly to pytorch-forecasting
, which is odd that it would mess up the whole pytorch-forecasting runtime unless I'm missing something...
Also, when I set gpus=[0]
this goes away and the script runs smoothly (but only with 1 GPU, so still no luck with multi-GPU).
I'm still playing around and am trying to figure out what's going on.
Oh – nope, got the "only recognizing one GPU" issue again.
Because I had issues with using two GPUs, I ran the learning rate finder with just 1 GPU (ran fine) and then tried to use 2 GPUs for training:
# Stop training, when loss metric does not improve on validation set
early_stop_callback = EarlyStopping(
monitor = "val_loss",
min_delta = 1e-4,
patience = 10,
verbose = False,
mode = "min"
)
lr_logger = LearningRateMonitor() # log the learning rate
logger = TensorBoardLogger("lightning_logs") # log to tensorboard
# Update trainer
trainer = pl.Trainer(
max_epochs = 500,
gpus = 2, #use 2 for 2 GPUs
distributed_backend = 'ddp', # https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes
weights_summary = "top",
gradient_clip_val = 0.1,
#limit_train_batches = 20, # comment in for training, running validation every 20 batches
#fast_dev_run=True, # comment in to check that network or dataset has no serious bugs
callbacks = [lr_logger, early_stop_callback],
logger = logger
)
# Update model
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate = res.suggestion(), #use res.suggestion() or manual input from lr_finder above (e.g., 0.03)
hidden_size = 16, # biggest influence network size
attention_head_size = 1,
dropout = 0.1,
hidden_continuous_size = 8,
output_size = 7, # QuantileLoss has 7 quantiles by default
loss=QuantileLoss(),
log_interval = 10, # log example every 10 batches
reduce_on_plateau_patience = 4, # reduce learning automatically
)
print(f"Number of parameters in transformer model: {tft.size()/1e3:.1f}k")
But this returned
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Number of parameters in transformer model: 23.4k
Also, once I run that bit, then when I run pl.Trainer(gpus = 2)
it only returns 1 GPU 🤯
Weirder yet is if I do this
^^^ Once I run
pl.Trainer(gpus = X)
it won't let me change the number of GPUs (X
).
So if I use 1 GPU for the learning rate finder I can't jump up to use 2 GPUs for full training.
Unrelated bug 🐛, but I get a Missing logger folder: lightning_logs/default
error now as well when I run
trainer.fit(
tft,
train_dataloader = train_dataloader,
val_dataloaders = val_dataloader
)
This folder used to be created automatically, but now I have to manually create it
Confirmed that the 2 GPU training still fails even when it's run on just training (learning rate finder is skipped and 0.03) is used:
Note that when I run it here, my terminal running jupyter lab does not throw the same
ipykernel
error as noted above.
Hm. Can you run it as a script? Maybe it is a ipython issue eventually. Sorry, I do not have readily 2 GPUs at hands to debug the issue. Have to say your issues are really helpful though!! Much appreciated!
Yeah, I was planning to convert it to a script soon just because I've done enough testing to get the notebook to work with multiple variables across all states. I'll let you know how that goes.
By the way, it seems like the issue may actually be due to pyzmq
for what it's worth.
Also, after letting the notebook sit on that *
running note for what felt like 5+ mins with no sign of life, it did pop out
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
I'm going to let it sit and see if we get to 2/2
and if it actually runs on both GPUs. Really odd that it takes so long for it to do the multiprocessing.
I may also post some of these issues on the pytorch-lightning channel since they're not directly related to pytorch-forecasting
(e.g., there's nothing really you can do about the pl.Trainer
bit).
I'll let you know if the multi-GPU eventually kicks off!
Glad to hear the posts have been helpful! Thanks for your feedback and support on these!
Okay, so the issue was having distributed_backend=ddp
– it should have been dp
(it runs right away on both GPUs with dp
) 🤦 . I misread that note in https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes (ddp
is multi-GPU over multiple machines, but the note about ddp
> dp
threw me off, and I should have read in more detail).
That being said, I now get this tensor shape error (which I confirmed only happens with multi-GPU runs and does not happen when gpus=1
):
# Stop training, when loss metric does not improve on validation set
early_stop_callback = EarlyStopping(
monitor = "val_loss",
min_delta = 1e-4,
patience = 10,
verbose = False,
mode = "min"
)
lr_logger = LearningRateMonitor() # log the learning rate
logger = TensorBoardLogger("lightning_logs") # logging results to a tensorboard
# Update trainer
trainer = pl.Trainer(
max_epochs = 500,
gpus = 2,
distributed_backend = 'dp', # https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes
weights_summary = "top",
gradient_clip_val = 0.1,
#limit_train_batches = 20, # comment in for training, running validation every 20 batches
#fast_dev_run=True, # comment in to check that network or dataset has no serious bugs
callbacks = [lr_logger, early_stop_callback],
logger = logger
)
# Update model
tft = TemporalFusionTransformer.from_dataset(
training,
#learning_rate = res.suggestion(), #use res.suggestion() or manual input from lr_finder above (e.g., 0.03)
learning_rate = 0.03, #use res.suggestion() or manual input from lr_finder above (e.g., 0.03)
hidden_size = 16, # biggest influence network size
attention_head_size = 1,
dropout = 0.1,
hidden_continuous_size = 8,
output_size = 7, # QuantileLoss has 7 quantiles by default
loss=QuantileLoss(),
log_interval = 10, # log example every 10 batches
reduce_on_plateau_patience = 4, # reduce learning automatically
)
print(f"Number of parameters in transformer model: {tft.size()/1e3:.1f}k")
# Train model
trainer.fit(
tft,
train_dataloader = train_dataloader,
val_dataloaders = val_dataloader
)
now throws
| Name | Type | Params
----------------------------------------------------------------------------------------
0 | loss | QuantileLoss | 0
1 | logging_metrics | ModuleList | 0
2 | input_embeddings | MultiEmbedding | 21
3 | prescalers | ModuleDict | 192
4 | static_variable_selection | VariableSelectionNetwork | 1 K
5 | encoder_variable_selection | VariableSelectionNetwork | 6 K
6 | decoder_variable_selection | VariableSelectionNetwork | 1 K
7 | static_context_variable_selection | GatedResidualNetwork | 1 K
8 | static_context_initial_hidden_lstm | GatedResidualNetwork | 1 K
9 | static_context_initial_cell_lstm | GatedResidualNetwork | 1 K
10 | static_context_enrichment | GatedResidualNetwork | 1 K
11 | lstm_encoder | LSTM | 2 K
12 | lstm_decoder | LSTM | 2 K
13 | post_lstm_gate_encoder | GatedLinearUnit | 544
14 | post_lstm_add_norm_encoder | AddNorm | 32
15 | static_enrichment | GatedResidualNetwork | 1 K
16 | multihead_attn | InterpretableMultiHeadAttention | 1 K
17 | post_attn_gate_norm | GateAddNorm | 576
18 | pos_wise_ff | GatedResidualNetwork | 1 K
19 | pre_output_gate_norm | GateAddNorm | 576
20 | output_layer | Linear | 119
Epoch 0: 0%
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-29-14fda4f79b4a> in <module>
1 # Train model
----> 2 trainer.fit(
3 tft,
4 train_dataloader = train_dataloader,
5 val_dataloaders = val_dataloader
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
438 self.call_hook('on_fit_start')
439
--> 440 results = self.accelerator_backend.train()
441 self.accelerator_backend.teardown()
442
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in train(self)
95
96 # train or test
---> 97 results = self.train_or_test()
98
99 return results
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
64 results = self.trainer.run_test()
65 else:
---> 66 results = self.trainer.train()
67 return results
68
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
481
482 # run train epoch
--> 483 self.train_loop.run_training_epoch()
484
485 if self.max_steps and self.max_steps <= self.global_step:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
539 # TRAINING_STEP + TRAINING_STEP_END
540 # ------------------------------------
--> 541 batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
542
543 # when returning -1 from train_step, we end epoch early
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
671 # calculate loss (train step + train step end)
672 # -------------------
--> 673 opt_closure_result = self.training_step_and_backward(
674 split_batch,
675 batch_idx,
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
758 """
759 # lightning module hook
--> 760 result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
761
762 if result is None:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in training_step(self, split_batch, batch_idx, opt_idx, hiddens)
302 with self.trainer.profiler.profile('model_forward'):
303 args = self.build_train_args(split_batch, batch_idx, opt_idx, hiddens)
--> 304 training_step_output = self.trainer.accelerator_backend.training_step(args)
305 training_step_output = self.trainer.call_hook('training_step_end', training_step_output)
306
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in training_step(self, args)
108 output = self.trainer.model(*args)
109 else:
--> 110 output = self.trainer.model(*args)
111 return output
112
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in forward(self, *inputs, **kwargs)
85
86 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
---> 87 outputs = self.parallel_apply(replicas, inputs, kwargs)
88
89 if isinstance(outputs[0], Result):
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
149
150 def parallel_apply(self, replicas, inputs, kwargs):
--> 151 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
152
153
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(modules, inputs, kwargs_tup, devices)
308 output = results[i]
309 if isinstance(output, Exception):
--> 310 raise output
311 outputs.append(output)
312 return outputs
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in _worker(i, module, input, kwargs, device)
261 # CHANGE
262 if module.training:
--> 263 output = module.training_step(*input, **kwargs)
264 fx_called = 'training_step'
265 elif module.testing:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in training_step(self, batch, batch_idx)
128 """
129 x, y = batch
--> 130 log, _ = self.step(x, y, batch_idx, label="train")
131 # log loss
132 self.log("train_loss", log["loss"], on_step=True, on_epoch=True, prog_bar=True)
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in step(self, x, y, batch_idx, label)
568 """
569 # extract data and run model
--> 570 log, out = super().step(x, y, batch_idx, label=label)
571 # calculate interpretations etc for latter logging
572 if self.log_interval(label == "train") > 0:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in step(self, x, y, batch_idx, label)
204 )
205 else:
--> 206 loss = self.loss(prediction, y)
207
208 # logging losses
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py in forward(self, *args, **kwargs)
143 """
144 # add current step
--> 145 self.update(*args, **kwargs)
146 self._forward_cache = None
147
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py in wrapped_func(*args, **kwargs)
189 def wrapped_func(*args, **kwargs):
190 self._computed = None
--> 191 return update(*args, **kwargs)
192 return wrapped_func
193
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/metrics.py in update(self, y_pred, target)
313 weight = None
314
--> 315 losses = self.loss(y_pred, target)
316 # weight samples
317 if weight is not None:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/metrics.py in loss(self, y_pred, target)
450 losses = []
451 for i, q in enumerate(self.quantiles):
--> 452 errors = target - y_pred[..., i]
453 losses.append(torch.max((q - 1) * errors, q * errors).unsqueeze(-1))
454 losses = torch.cat(losses, dim=2)
RuntimeError: The size of tensor a (30) must match the size of tensor b (43) at non-singleton dimension 1
Also letting you know that this solves the ipykernel
issue with pyZMQ
– probably because it was trying to find or send something to a second (non-existent) node on a cluster. So yay for that! Looks like this is back to something that's only related to pytorch-forecast
😄
Confirmed this same error – post-dp
switch – is thrown on the learning rate finder:
# Configure network and trainer
pl.seed_everything(407)
trainer = pl.Trainer(
gpus = 2, #use 2 for both GPUs
distributed_backend = 'dp', # https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes
gradient_clip_val = 0.1 # hyperparam to prevent gradient divergance for RNNs
)
tft = TemporalFusionTransformer.from_dataset(
training,
# not meaningful for finding the learning rate but otherwise very important
learning_rate = 0.03,
hidden_size = 16, # most important hyperparameter apart from learning rate
# number of attention heads. Set to up to 4 for large datasets
attention_head_size = 1,
dropout = 0.1, # between 0.1 and 0.3 are good values
hidden_continuous_size = 8, # set to <= hidden_size
output_size = 7, # 7 quantiles by default
loss = QuantileLoss(),
# reduce learning rate if no improvement in validation loss after x epochs
reduce_on_plateau_patience = 4,
)
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")
# Find optimal learning rate
res = trainer.tuner.lr_find(
tft,
train_dataloader = train_dataloader,
val_dataloaders = val_dataloader,
max_lr = 10.,
min_lr = 1e-6,
)
print(f"Suggested learning rate: {res.suggestion()}")
fig = res.plot(show = True, suggest = True)
fig.show()
throws
| Name | Type | Params
----------------------------------------------------------------------------------------
0 | loss | QuantileLoss | 0
1 | logging_metrics | ModuleList | 0
2 | input_embeddings | MultiEmbedding | 21
3 | prescalers | ModuleDict | 192
4 | static_variable_selection | VariableSelectionNetwork | 1 K
5 | encoder_variable_selection | VariableSelectionNetwork | 6 K
6 | decoder_variable_selection | VariableSelectionNetwork | 1 K
7 | static_context_variable_selection | GatedResidualNetwork | 1 K
8 | static_context_initial_hidden_lstm | GatedResidualNetwork | 1 K
9 | static_context_initial_cell_lstm | GatedResidualNetwork | 1 K
10 | static_context_enrichment | GatedResidualNetwork | 1 K
11 | lstm_encoder | LSTM | 2 K
12 | lstm_decoder | LSTM | 2 K
13 | post_lstm_gate_encoder | GatedLinearUnit | 544
14 | post_lstm_add_norm_encoder | AddNorm | 32
15 | static_enrichment | GatedResidualNetwork | 1 K
16 | multihead_attn | InterpretableMultiHeadAttention | 1 K
17 | post_attn_gate_norm | GateAddNorm | 576
18 | pos_wise_ff | GatedResidualNetwork | 1 K
19 | pre_output_gate_norm | GateAddNorm | 576
20 | output_layer | Linear | 119
Finding best initial lr: 0%
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-29-6bbc95624f88> in <module>
1 # Find optimal learning rate
----> 2 res = trainer.tuner.lr_find(
3 tft,
4 train_dataloader = train_dataloader,
5 val_dataloaders = val_dataloader,
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/tuner/tuning.py in lr_find(self, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold, datamodule)
115 datamodule: Optional[LightningDataModule] = None
116 ):
--> 117 return lr_find(
118 self.trainer,
119 model,
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/tuner/lr_finder.py in lr_find(trainer, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold, datamodule)
170
171 # Fit, lr & loss logged in callback
--> 172 trainer.fit(model,
173 train_dataloader=train_dataloader,
174 val_dataloaders=val_dataloaders,
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
438 self.call_hook('on_fit_start')
439
--> 440 results = self.accelerator_backend.train()
441 self.accelerator_backend.teardown()
442
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in train(self)
95
96 # train or test
---> 97 results = self.train_or_test()
98
99 return results
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
64 results = self.trainer.run_test()
65 else:
---> 66 results = self.trainer.train()
67 return results
68
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
481
482 # run train epoch
--> 483 self.train_loop.run_training_epoch()
484
485 if self.max_steps and self.max_steps <= self.global_step:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
539 # TRAINING_STEP + TRAINING_STEP_END
540 # ------------------------------------
--> 541 batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
542
543 # when returning -1 from train_step, we end epoch early
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
671 # calculate loss (train step + train step end)
672 # -------------------
--> 673 opt_closure_result = self.training_step_and_backward(
674 split_batch,
675 batch_idx,
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
758 """
759 # lightning module hook
--> 760 result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
761
762 if result is None:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in training_step(self, split_batch, batch_idx, opt_idx, hiddens)
302 with self.trainer.profiler.profile('model_forward'):
303 args = self.build_train_args(split_batch, batch_idx, opt_idx, hiddens)
--> 304 training_step_output = self.trainer.accelerator_backend.training_step(args)
305 training_step_output = self.trainer.call_hook('training_step_end', training_step_output)
306
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in training_step(self, args)
108 output = self.trainer.model(*args)
109 else:
--> 110 output = self.trainer.model(*args)
111 return output
112
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in forward(self, *inputs, **kwargs)
85
86 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
---> 87 outputs = self.parallel_apply(replicas, inputs, kwargs)
88
89 if isinstance(outputs[0], Result):
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
149
150 def parallel_apply(self, replicas, inputs, kwargs):
--> 151 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
152
153
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(modules, inputs, kwargs_tup, devices)
308 output = results[i]
309 if isinstance(output, Exception):
--> 310 raise output
311 outputs.append(output)
312 return outputs
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in _worker(i, module, input, kwargs, device)
261 # CHANGE
262 if module.training:
--> 263 output = module.training_step(*input, **kwargs)
264 fx_called = 'training_step'
265 elif module.testing:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in training_step(self, batch, batch_idx)
128 """
129 x, y = batch
--> 130 log, _ = self.step(x, y, batch_idx, label="train")
131 # log loss
132 self.log("train_loss", log["loss"], on_step=True, on_epoch=True, prog_bar=True)
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in step(self, x, y, batch_idx, label)
568 """
569 # extract data and run model
--> 570 log, out = super().step(x, y, batch_idx, label=label)
571 # calculate interpretations etc for latter logging
572 if self.log_interval(label == "train") > 0:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in step(self, x, y, batch_idx, label)
204 )
205 else:
--> 206 loss = self.loss(prediction, y)
207
208 # logging losses
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py in forward(self, *args, **kwargs)
143 """
144 # add current step
--> 145 self.update(*args, **kwargs)
146 self._forward_cache = None
147
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py in wrapped_func(*args, **kwargs)
189 def wrapped_func(*args, **kwargs):
190 self._computed = None
--> 191 return update(*args, **kwargs)
192 return wrapped_func
193
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/metrics.py in update(self, y_pred, target)
313 weight = None
314
--> 315 losses = self.loss(y_pred, target)
316 # weight samples
317 if weight is not None:
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/metrics.py in loss(self, y_pred, target)
450 losses = []
451 for i, q in enumerate(self.quantiles):
--> 452 errors = target - y_pred[..., i]
453 losses.append(torch.max((q - 1) * errors, q * errors).unsqueeze(-1))
454 losses = torch.cat(losses, dim=2)
RuntimeError: The size of tensor a (30) must match the size of tensor b (43) at non-singleton dimension 1
Given that consistency, my intuition tells me the 30
is how big my forecasting window is (30 days), but not sure where the 43 is coming from nor why this only happens in a multi-GPU run and not in a single-GPU run.
When I use %debugger
in the notebook these are what the variables look like that are throwing the error:
ipdb> print(i)
0
ipdb> print(q)
0.02
ipdb> y_pred.shape
torch.Size([16, 43, 7])
ipdb> y_pred[0].shape
torch.Size([43, 7])
ipdb> y_pred[..., i].shape
torch.Size([16, 43])
ipdb> len(target)
16
ipdb> target[0].shape
torch.Size([30])
So y_pred
is the one that's too long.
(Still debugging)
Still exploring this issue. I wanted to unpack my notebook and make it more module since the pipeline is getting built up first. Everything else seems to be running smoothly now except for the multi-GPU bit. Something has to be happening to
y_pred
right before it goes from pytorch-forecasting
into pytorch-lightning
. Should be easy to track down. I'll keep you updated. If you think of anything I should try (I can pull and install a GitHub version of pytorch-forecasting
and try the edit and do a PR), please let me know 😄 My present plan is just to print the shape of the prediction-based tensor as it's modified up to when to gets moved to pytorch-lightning
and when it goes from pytorch-lightning
to pytorch-forecasting
.
@AlexMRuch Any success?
Not yet. I have it on my calendar to debug after Thanksgiving. Have a few deadlines to hit before then, and the code is working satisfactorily on one GPU for now. Shouldn’t be too difficult to pin down once I get time to run things though. Definitely plan on figuring it out, though, as I’m looking forward to speeding up the hyperparameter optimization :-)
On Oct 31, 2020, at 5:57 PM, Jan Beitner notifications@github.com wrote:
@AlexMRuch https://github.com/AlexMRuch Any success?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jdb78/pytorch-forecasting/issues/103#issuecomment-719993355, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIYWOIGEKSOB3ZQ2IOVYFLSNSB3JANCNFSM4SSEFKCA.
Totally understand! No rush :) Just was curious ;)
Hey hey – just wanted to update you that my projects got shifted up and we're not using time series anymore 😬 I don't think I'm going to have time to investigate this anytime soon, especially as I'm on family leave for our daughter's birth until mid March.
See also #215. At least "ddp" seems to work now.
Was a fix ever found for this? I tried using 2 GPUs with ddp and I have the above mentioned issue wherein training does not start.
When I initialize my TFT trainer to use multiple GPUs
The library is able to recognize that I used both GPUs
However, when I try to find the optimal learning rate
I get an
AttributeError: Can't pickle local object '_apply_to_outputs.<locals>.decorator_fn.<locals>.new_func'
error with the following trace:Any idea what may be triggering this? My guess is that because I'm not distributing across multiple machines, the pickle is getting messed up. That's fine and just indicates I misunderstood that setting for
distributed_backend
, but moving on, I hit errors with the otherdistributed_backend
settings as well.Following https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes, when I hard-code
distributed_backend
toddp2
, I get this traceand when I hard-code
distributed_backend
todp
(which is what I would expect to work most readily), I getWhen I use
ddp
(as recommended for pytorch, given the speedup), the pipeline freezes and runningwatch nvidia-smi
from the terminal just shows the GPUs aren't moving and aren't loading any memory for processing.This error is thrown using the same setup as I had in #85, which I got working on a single GPU but now that I'm doing multivariate time series across all 50 states I'd really like to use both my GPUs to speed up the runtime.
Thanks!