Open tombh opened 3 years ago
Have you tried with the Trainer
argument accelerator="ddp"
?
Yes, and it's the same error.
BTW, I forgot to say thank you for this project 😃
Hm. The error is mostly happening if the output contains nans
. Executed for me on the CPU. Same for you?
And thanks for liking the project :)
I'd seen in the issues here that this exact error can occur if there are nans
, and so that's what I explored in my own production datasets. But the example code above that I provided to reproduce the problem can't have nans
in it right? So yes, the above code runs perfectly on my laptop's CPU (after changing gpu=3
to gpu=None
). It's only when I run it on a production server with multiple GPUs that the error occurs.
Hm. I can imagine there still to be nans. Can you explicitly define the target_normalizer
?
I wonder if I'm misunderstanding where to look for nans? I don't suppose it's in the data
returned from create_dataframes()
above? Rather I should be looking at the dataset returned by TimeSeriesDataSet
? This is the data after it's been normalised and turned into samples? Is that what you mean by "output"? And this is where I should be looking for nans instead? How do I look at that? Sorry for so many questions!
I added the default normalizers explicitly to the code example above, is that what you meant? Like:
target_normalizer=MultiNormalizer(
normalizers=[
EncoderNormalizer(transformation="relu"),
EncoderNormalizer(transformation="relu"),
]
),
What about the newest release 0.8.5? It solves an edge-case issue with normalisation.
Hello @jdb78,
Thanks for great project. I had similar problem with multiple targets. First I had MultiLoss
with single metric, second I had single value inoutput_size
in TFT. (If I am not a noob, then would be great to add this tips in documentation;).
My working toy code(torch==1.8.1; pytorch-forecasting==0.8.5; pytorch-lightning==1.2.10):
import pandas as pd
import numpy as np
import pytorch_lightning as pl
import torch
from pytorch_forecasting.models import TemporalFusionTransformer
from pytorch_forecasting.metrics import QuantileLoss, MultiLoss
from pytorch_forecasting import TimeSeriesDataSet
df = pd.DataFrame(np.random.randn(100, 4))
df.columns = ['x1', 'x2', 'target1', 'target2']
df['constant'] = 1
df['time_idx'] = df.index
training_ds = TimeSeriesDataSet(
df[df.time_idx < 90],
time_idx='time_idx',
target=['target1', 'target2'],
group_ids=['constant'],
min_encoder_length=3,
max_encoder_length=3,
min_prediction_length=1,
max_prediction_length=1,
time_varying_unknown_reals=['x1', 'x2', 'target1', 'target2'],
add_relative_time_idx=True,
add_target_scales=True,
add_encoder_length=True,
)
validation_ds = TimeSeriesDataSet.from_dataset(
training_ds, df[df.time_idx > 90], predict=True, stop_randomization=True)
train_dataloader = training_ds.to_dataloader(train=True, batch_size=10, num_workers=4)
val_dataloader = validation_ds.to_dataloader(train=False, batch_size=10, num_workers=4)
trainer = pl.Trainer(max_epochs=10, gpus=0, gradient_clip_val=0.1, limit_train_batches=30)
tft = TemporalFusionTransformer.from_dataset(
training_ds, learning_rate=0.03, hidden_size=16,attention_head_size=1,
dropout=0.1, hidden_continuous_size=8, output_size=[8, 8],
loss=MultiLoss(metrics=[QuantileLoss(), QuantileLoss()]),
log_interval=10, reduce_on_plateau_patience=4,)
trainer.fit(tft,
train_dataloader=train_dataloader,
val_dataloaders=val_dataloader)
Same error with @tombh when using multiple target with TFT model and "ddp" accelerator (2 gpus), but if I just change the target to single one, it is OK. Any progress here?
Expected behavior
Just to work as normal
Actual behavior
Error is:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback
``` -- Process 2 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 161, in new_process results = trainer.train_or_test_or_predict() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 556, in train_or_test_or_predict results = self.run_train() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train self.train_loop.run_training_epoch() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 654, in run_training_batch self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 425, in optimizer_step model_ref.optimizer_step( File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/lightning.py", line 1390, in optimizer_step optimizer.step(closure=optimizer_closure) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 214, in step self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 277, in optimizer_step self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 282, in run_optimizer_step self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 163, in optimizer_step optimizer.step(closure=lambda_closure, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_forecasting/optim.py", line 131, in step _ = closure() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 648, in train_step_and_backward_closure result = self.training_step_and_backward( File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 755, in training_step_and_backward self.backward(result, optimizer, opt_idx) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 784, in backward result.closure_loss = self.trainer.accelerator.backward( File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 256, in backward output = self.precision_plugin.backward( File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 71, in backward model.backward(closure_loss, optimizer, opt_idx) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/lightning.py", line 1251, in backward loss.backward(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn ```Code to reproduce the problem