Open TimothyLiuu opened 3 years ago
I got the same error but with the default QuantileLoss.
Tested on:
Thank you for your help!
Are you using multiple GPUs? For what it's worth – I'm also hitting this error, but only when using multiple GPUs and multiple targets.
I've ensured there's no nulls in my dataset, values are normalized, low learning rate with clipped gradients to reduce instability. Here's what I'm noticing:
Single target + CPU --> works
Multiple targets + CPU --> works
Single target + 1 GPU --> works
Multiple targets + 1 GPU --> works
Single target + multiple GPUs --> works
Multiple targets + multiple GPUs --> broken
Are you using multiple GPUs? For what it's worth – I'm also hitting this error, but only when using multiple GPUs and multiple targets.
I've ensured there's no nulls in my dataset, values are normalized, low learning rate with clipped gradients to reduce instability. Here's what I'm noticing:
Single target + CPU --> works Multiple targets + CPU --> works Single target + 1 GPU --> works Multiple targets + 1 GPU --> works Single target + multiple GPUs --> works Multiple targets + multiple GPUs --> broken
I use "Single target + 1 GPU ", but it ends with the error.
Are you using multiple GPUs? For what it's worth – I'm also hitting this error, but only when using multiple GPUs and multiple targets.
I've ensured there's no nulls in my dataset, values are normalized, low learning rate with clipped gradients to reduce instability. Here's what I'm noticing:
Single target + CPU --> works Multiple targets + CPU --> works Single target + 1 GPU --> works Multiple targets + 1 GPU --> works Single target + multiple GPUs --> works Multiple targets + multiple GPUs --> broken
Same problem. Any progress?
I am having the same issue with Single target + multiple GPUs with TFT with QuantileLoss(). Any news on this? @jdb78 thank you for this awesome project!
@jdb78, I saw several cases in issues but not try out a solution. I use 1GPU, 1 target and got the same error. sorry I cannot provide a colab, the code is like:
training = TimeSeriesDataSet(
data[lambda x: x.time_idx <= training_cutoff],
time_idx="time_idx",
target="usage",
group_ids=["A", "B", "C", "D"],
min_encoder_length=max_encoder_length // 2,
max_encoder_length=max_encoder_length,
min_prediction_length=1,
max_prediction_length=max_prediction_length,
static_categoricals=["A", "B", "C", "D"],
time_varying_known_categoricals=["day"],
time_varying_known_reals=["time_idx"],
time_varying_unknown_categoricals=[],
time_varying_unknown_reals=[
"usage",
"log_usage",
"avg_usage_by_A",
"avg_usage_by_B",
"avg_usage_by_C",
],
allow_missing_timesteps=True,
categorical_encoders={'A': NaNLabelEncoder(add_nan=True),
'B': NaNLabelEncoder(add_nan=True),
'C': NaNLabelEncoder(add_nan=True),
'D': NaNLabelEncoder(add_nan=True),
}
... ...
# throws RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
res = trainer.tuner.lr_find(
tft,
train_dataloader=train_dataloader,
val_dataloaders=val_dataloader,
max_lr=1,
min_lr=1e-6,
)
@jdb78, I saw several cases in issues but not try out a solution. I use 1GPU, 1 target and got the same error. sorry I cannot provide a colab, the code is like:
training = TimeSeriesDataSet( data[lambda x: x.time_idx <= training_cutoff], time_idx="time_idx", target="usage", group_ids=["A", "B", "C", "D"], min_encoder_length=max_encoder_length // 2, max_encoder_length=max_encoder_length, min_prediction_length=1, max_prediction_length=max_prediction_length, static_categoricals=["A", "B", "C", "D"], time_varying_known_categoricals=["day"], time_varying_known_reals=["time_idx"], time_varying_unknown_categoricals=[], time_varying_unknown_reals=[ "usage", "log_usage", "avg_usage_by_A", "avg_usage_by_B", "avg_usage_by_C", ], allow_missing_timesteps=True, categorical_encoders={'A': NaNLabelEncoder(add_nan=True), 'B': NaNLabelEncoder(add_nan=True), 'C': NaNLabelEncoder(add_nan=True), 'D': NaNLabelEncoder(add_nan=True), } ... ... # throws RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn res = trainer.tuner.lr_find( tft, train_dataloader=train_dataloader, val_dataloaders=val_dataloader, max_lr=1, min_lr=1e-6, )
I tried using min_pred_length=max_pred_length and downloading the version in master of pytorch ligthning and worked for me, maybe you can try that. Hope it helps
@jdb78, I saw several cases in issues but not try out a solution. I use 1GPU, 1 target and got the same error. sorry I cannot provide a colab, the code is like:
training = TimeSeriesDataSet( data[lambda x: x.time_idx <= training_cutoff], time_idx="time_idx", target="usage", group_ids=["A", "B", "C", "D"], min_encoder_length=max_encoder_length // 2, max_encoder_length=max_encoder_length, min_prediction_length=1, max_prediction_length=max_prediction_length, static_categoricals=["A", "B", "C", "D"], time_varying_known_categoricals=["day"], time_varying_known_reals=["time_idx"], time_varying_unknown_categoricals=[], time_varying_unknown_reals=[ "usage", "log_usage", "avg_usage_by_A", "avg_usage_by_B", "avg_usage_by_C", ], allow_missing_timesteps=True, categorical_encoders={'A': NaNLabelEncoder(add_nan=True), 'B': NaNLabelEncoder(add_nan=True), 'C': NaNLabelEncoder(add_nan=True), 'D': NaNLabelEncoder(add_nan=True), } ... ... # throws RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn res = trainer.tuner.lr_find( tft, train_dataloader=train_dataloader, val_dataloaders=val_dataloader, max_lr=1, min_lr=1e-6, )
My code kept having the error. I adjusted gradient_clip_val within the trainer. Now it seems to never have the issue.
My trainer before fixing the error
trainer = pl.Trainer(
max_epochs=45,
accelerator='mps',
devices=1,
limit_train_batches=50, # coment in for training, running valiation every 30 batches
#fast_dev_run=True, # comment in to check that networkor dataset has no serious bugs
enable_model_summary=True,
gradient_clip_val=0.1,
callbacks=[lr_logger, early_stop_callback],
logger=logger,
)
My trainer after the error went away
trainer = pl.Trainer(
max_epochs=45,
accelerator='mps',
devices=1,
limit_train_batches=50, # coment in for training, running valiation every 30 batches
#fast_dev_run=True, # comment in to check that networkor dataset has no serious bugs
enable_model_summary=True,
gradient_clip_val=0.05,
callbacks=[lr_logger, early_stop_callback],
logger=logger,
)
@TimothyLiuu were you able to solve the problem after all?
We are training TFT(temporal fusion transformer model) on 4 GPUs using AWS Sagemaker training jobs. Here an overview of what we have done so far:
We are using pytorch_forecasting 0.10.3 and PyTorch_lightning 1.7.7.
Any feedback would be appreciated. :)
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 2 Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, *kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, *kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step optimizer.step(closure=optimizer_closure) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 289, in optimizer_step optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/usr/local/lib/python3.8/site-packages/torch/optim/optimizer.py", line 113, in wrapper return func(*args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorchforecasting/optim.py", line 143, in step = closure() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure closure_result = closure() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in call self._result = self.closure(*args, *kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure self._backward_fn(step_output.closure_loss) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward model.backward(closure_loss, optimizer, optimizer_idx, *args, *kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1450, in backward loss.backward(args, kwargs) File "/usr/local/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 2 Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, *kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, *kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step optimizer.step(closure=optimizer_closure) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 289, in optimizer_step optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/usr/local/lib/python3.8/site-packages/torch/optim/optimizer.py", line 113, in wrapper return func(*args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorchforecasting/optim.py", line 143, in step = closure() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure closure_result = closure() File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in call self._result = self.closure(*args, *kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure self._backward_fn(step_output.closure_loss) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward model.backward(closure_loss, optimizer, optimizer_idx, *args, *kwargs) File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1450, in backward loss.backward(args, kwargs) File "/usr/local/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
2023-11-22T16:33:55.181+01:00 RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I get this error when running on mps, not cpu. The y_pred passed to my loss function is all nans.
Expected behavior
Hello!Appreciate for your brilliant work! When I using the Temporal_Fusion_Transformer I refer to the class "QuantileLoss(MultiHorizonMetric)" and modify the loss function to expect the model prediction results to be more accurate.
Actual behavior
However, the error is: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Code to reproduce the problem
The define of MyLoss() is:
And the error is: