RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. When refitting with updated data.

DaniloMendezR commented 1 year ago

PyTorch-Forecasting version: 1.0.0
PyTorch version: 2.1.0.dev20230703
Python version: 3.10.11
Operating System: macOS Venturea Version 13.1

My goal is to create a 14 day forecast on demand using my own data. The demand follows a grouping of Machine and Dish. I followed the tutorial and created a TimeSeriesDataSet with my own data. I've successfully fit and predicted with my own data and got good results.

Expected behavior

Adding newer data and refit the model will lead to a better tuned model, the pipeline would stay the same and could handle additional demand observations.

Actual behavior

When I added new data the first time I got the error in the title. I have no idea why. The only thing I new is that the problem was in the new data. I managed to find the Machine Dish combination that was giving me the problem so I removed it and the problem was fixed.

The second time I added data I had no problems.

The third time I added data the problem returned and I don't understand the mechanism behind why its here or how to automatically identify the problematic Machine Dish combinations to remove them from the data.

Code to reproduce the problem

I can't share the data since its private, but I can sare the code I used. Important to note that the prediction length is 14.

max_encoder_croston_simple = (croston_simple_covariates['Date'].max() - croston_simple_covariates['Date'].min()).days
croston_simple_training = TimeSeriesDataSet(
    croston_simple_training_df,
    time_idx="time_idx",
    target="Demand",
    group_ids=["Machine", "Dish"],
    min_encoder_length=1,  # to allow for cold starts
    max_encoder_length=max_encoder_croston_simple,
    min_prediction_length=1,
    max_prediction_length=max_prediction_length,
    min_prediction_idx=1,
    static_categoricals= static_categoricals,
    static_reals=static_reals,
    time_varying_known_categoricals=time_varying_known_categoricals,
    variable_groups=variable_groups,  # group of categorical variables can be treated as one variable
    time_varying_known_reals=time_varying_known_reals,
    time_varying_unknown_categoricals=[],
    time_varying_unknown_reals=time_varying_unknown_reals,
    target_normalizer=GroupNormalizer(
        groups=["Machine", "Dish"], transformation="softplus"
    ),  # use softplus and normalize by group
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True,
    allow_missing_timesteps=True,
    categorical_encoders = {"Dish":NaNLabelEncoder(add_nan=True),
                            "Machine": NaNLabelEncoder(add_nan=True),
                            "holidays":NaNLabelEncoder(add_nan=True),
                            "Month":NaNLabelEncoder(add_nan=True),
                            "Category": NaNLabelEncoder(add_nan=True),
                            "dish_temperature": NaNLabelEncoder(add_nan=True),
                            "DayOfWeek": NaNLabelEncoder(add_nan=True),
                            "machineActivity": NaNLabelEncoder(add_nan=True)}
)

batch_size = 42  # set this between 32 to 128
train_dataloader_croston_simple = croston_simple_training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader_croston_simple = croston_simple_validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=0)

early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=5, verbose=True, mode="min")
lr_logger = LearningRateMonitor()  # log the learning rate
croston_simple_logger = TensorBoardLogger(save_dir="lightning_logs", name = "croston_exploration")

croston_simple_trainer = pl.Trainer(
  max_epochs= 100,
  accelerator="cpu",
  enable_model_summary=True,
  gradient_clip_val=croston_simple_study.best_trial.params["gradient_clip_val"],
  limit_train_batches=50,
  #fast_dev_run=True,
  callbacks=[lr_logger, early_stop_callback],
  logger=croston_simple_logger,

)

croston_simple_tft = TemporalFusionTransformer.from_dataset(
    croston_simple_training,
    learning_rate=croston_simple_study.best_trial.params['learning_rate'],
    hidden_size=croston_simple_study.best_trial.params['hidden_size'],
    attention_head_size=croston_simple_study.best_trial.params['attention_head_size'],
    dropout=croston_simple_study.best_trial.params['dropout'],
    hidden_continuous_size=croston_simple_study.best_trial.params['hidden_continuous_size'],
    loss=QuantileLoss(),
    log_interval=5,  # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
    optimizer="Ranger",
    reduce_on_plateau_patience=4,
)
print(f"Number of parameters in network: {croston_simple_tft.size()/1e3:.1f}k")

# fit network
croston_simple_trainer.fit(
    croston_simple_tft,
    train_dataloaders=train_dataloader_croston_simple,
    val_dataloaders=val_dataloader_croston_simple,
)

RuntimeError                              Traceback (most recent call last)
Cell In[15], line 2
      1 # fit network
----> 2 croston_simple_trainer.fit(
      3     croston_simple_tft,
      4     train_dataloaders=train_dataloader_croston_simple,
      5     val_dataloaders=val_dataloader_croston_simple,
      6 )

File /opt/homebrew/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:531, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    529 model = _maybe_unwrap_optimized(model)
    530 self.strategy._lightning_module = model
--> 531 call._call_and_handle_interrupt(
    532     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    533 )

File /opt/homebrew/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     40     if trainer.strategy.launcher is not None:
     41         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 42     return trainer_fn(*args, **kwargs)
     44 except _TunerExitException:
     45     _call_teardown_hook(trainer)

File /opt/homebrew/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:570, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
...
--> 204 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    205     tensors, grad_tensors_, retain_graph, create_graph, inputs,
    206     allow_unreachable=True, accumulate_grad=True)

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

manitadayon commented 1 year ago

Hey, try changing the gradient clipping parameter as well as play with learning rate.

DaniloMendezR commented 1 year ago

I've tried this for a couple of hours but it doesnt seem to get me anywhere. I didn't have this problem for two weeks straight but then it showed up again.

DaniloMendezR commented 1 year ago

Hey, try changing the gradient clipping parameter as well as play with learning rate.

Im trying to use the optimize_hyperparameters function to find the best instead of guessing. But it turns out that this problem persists within the optuna optimization. Could this be a version problem? I'm stumped.

lpdbrxx commented 11 months ago

Any updates on the error ? I have the same one, on a similar task.

DaniloMendezR commented 11 months ago

Unfortunately I have stopped using this package too many issues going on at the same time, shame, it was working really well for a while.

sktime / pytorch-forecasting