sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
4k stars 632 forks source link

Issues running TDS Stallion Example with W&B logger #79

Closed emigre459 closed 4 years ago

emigre459 commented 4 years ago

I'm seeing issues trying to run the W&B logger when replicating the Towards Data Science example with the Stallion dataset (to be fair, switching to TensorBoard made it fail too, although for a completely different-sounding reason oddly). When I try to train the model, I get a single graphical output (attached) and then it errors out. I'm using:

- pytorch=1.4.0
- pytorch-forecasting=0.4.1
- pytorch-lightning=0.9.0
- wandb=0.10.4

I know the requirements.txt indicates (py)torch >= 1.6, but I can't get conda to find a good solution for that in my dependency tree, and this seems to be a logger issue anyhow. Here's the full traceback:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-562ea3edbba3> in <module>
     25     tft,
     26     train_dataloader=train_dataloader,
---> 27     val_dataloaders=val_dataloader
     28 )

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py in wrapped_fn(self, *args, **kwargs)
     46             if entering is not None:
     47                 self.state = entering
---> 48             result = fn(self, *args, **kwargs)
     49 
     50             # The INTERRUPTED state can be set inside the run function. To indicate that run was interrupted

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
   1082             self.accelerator_backend = CPUBackend(self)
   1083             self.accelerator_backend.setup(model)
-> 1084             results = self.accelerator_backend.train(model)
   1085 
   1086         # on fit end callback

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/accelerators/cpu_backend.py in train(self, model)
     37 
     38     def train(self, model):
---> 39         results = self.trainer.run_pretrain_routine(model)
     40         return results

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
   1222 
   1223         # run a few val batches before training starts
-> 1224         self._run_sanity_check(ref_model, model)
   1225 
   1226         # clear cache before training

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in _run_sanity_check(self, ref_model, model)
   1255             num_loaders = len(self.val_dataloaders)
   1256             max_batches = [self.num_sanity_val_steps] * num_loaders
-> 1257             eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
   1258 
   1259             # allow no returns from eval

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py in _evaluate(self, model, dataloaders, max_batches, test_mode)
    331                         output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
    332                 else:
--> 333                     output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
    334 
    335                 is_result_obj = isinstance(output, Result)

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py in evaluation_forward(self, model, batch, batch_idx, dataloader_idx, test_mode)
    685             output = model.test_step(*args)
    686         else:
--> 687             output = model.validation_step(*args)
    688 
    689         return output

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_forecasting/models/base_model.py in validation_step(self, batch, batch_idx)
    138     def validation_step(self, batch, batch_idx):
    139         x, y = batch
--> 140         log, _ = self.step(x, y, batch_idx, label="val")
    141         return log
    142 

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in step(self, x, y, batch_idx, label)
    595         """
    596         # extract data and run model
--> 597         log, out = super().step(x, y, batch_idx, label=label)
    598         # calculate interpretations etc for latter logging
    599         if self.log_interval(label == "train") > 0:

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_forecasting/models/base_model.py in step(self, x, y, batch_idx, label)
    223             log["loss"] = loss
    224         if self.log_interval(label == "train") > 0:
--> 225             self._log_prediction(x, out, batch_idx, label=label)
    226         return log, out
    227 

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_forecasting/models/base_model.py in _log_prediction(self, x, out, batch_idx, label)
    281                 else:
    282                     tag += f" of item {idx} in batch {batch_idx}"
--> 283                 self.logger.experiment.add_figure(
    284                     tag,
    285                     fig,

AttributeError: 'Run' object has no attribute 'add_figure'

It seems as though pytorch_forecasting is assuming that all loggers have the same add_figure() method, but clearly that's not the case in this version of W&B/pytorch-lightning. Any thoughts on way to rectify this? I'd also be game for a workaround to disable the default figure generation during training, although it is very nice to get those super-informative figures so I'd rather get them working if I could!

TFT_Fig
jdb78 commented 4 years ago

If you want to switch logging off, just pass log_interval=-1. If you are aware of a unified logging interface, I would be keen to use it. Alternatively, one could

  1. write such a logging interface, e.g. it could be passed as part of the __init__ arguments to the models.
  2. override the methods that implement logging. Most of them are already separate such as _log_prediction()

From a user perspective option 1 would be preferable. Any thoughts?

For tensorboard, keep in mind that you should deinstall tensorflow or use the workaround in #58. To install pytorch >=1.6 with conda, you have to use the pytorch channel, i.e. conda install pytorch -c pytorch.

emigre459 commented 4 years ago

OK, cool. Once I can get my example up and running, I might take a whack at the unified logging interface, as I'm not aware of such a thing right now. When I disable logging as you've said by setting log_interval=-1, I get the same error I saw earlier when trying to log via Tensorboard instead of W&B (I checked and my environment does not have TensorFlow installed, but does have tensorboard=2.2.0). Here's the new traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-562ea3edbba3> in <module>
     25     tft,
     26     train_dataloader=train_dataloader,
---> 27     val_dataloaders=val_dataloader
     28 )

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py in wrapped_fn(self, *args, **kwargs)
     46             if entering is not None:
     47                 self.state = entering
---> 48             result = fn(self, *args, **kwargs)
     49 
     50             # The INTERRUPTED state can be set inside the run function. To indicate that run was interrupted

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
   1082             self.accelerator_backend = CPUBackend(self)
   1083             self.accelerator_backend.setup(model)
-> 1084             results = self.accelerator_backend.train(model)
   1085 
   1086         # on fit end callback

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/accelerators/cpu_backend.py in train(self, model)
     37 
     38     def train(self, model):
---> 39         results = self.trainer.run_pretrain_routine(model)
     40         return results

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
   1237 
   1238         # CORE TRAINING LOOP
-> 1239         self.train()
   1240 
   1241     def _run_sanity_check(self, ref_model, model):

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in train(self)
    392                 # RUN TNG EPOCH
    393                 # -----------------
--> 394                 self.run_training_epoch()
    395 
    396                 if self.max_steps and self.max_steps <= self.global_step:

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    477         # run epoch
    478         for batch_idx, (batch, is_last_batch) in self.profiler.profile_iterable(
--> 479                 enumerate(_with_is_last(train_dataloader)), "get_train_batch"
    480         ):
    481             # stop epoch if we limited the number of training batches

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/profiler/profilers.py in profile_iterable(self, iterable, action_name)
     76             try:
     77                 self.start(action_name)
---> 78                 value = next(iterator)
     79                 self.stop(action_name)
     80                 yield value

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in _with_is_last(iterable)
   1320     See `https://stackoverflow.com/a/1630350 <https://stackoverflow.com/a/1630350&gt;`_"""
   1321     it = iter(iterable)
-> 1322     last = next(it)
   1323     for val in it:
   1324         # yield last and has next

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    343 
    344     def __next__(self):
--> 345         data = self._next_data()
    346         self._num_yielded += 1
    347         if self._dataset_kind == _DatasetKind.Iterable and \

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    383     def _next_data(self):
    384         index = self._next_index()  # may raise StopIteration
--> 385         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    386         if self._pin_memory:
    387             data = _utils.pin_memory.pin_memory(data)

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_forecasting/data/timeseries.py in __getitem__(self, idx)
    965                 x_cont=data_cont,
    966                 encoder_length=encoder_length,
--> 967                 encoder_target=target[:encoder_length],
    968                 encoder_time_idx_start=time[0],
    969                 groups=groups,

TypeError: only integer tensors of a single element can be converted to an index

It seems to occur right after Lightning does the validation sanity checks, as part of building the first training batch. Not sure why W&B approach seemed able to complete 10-20 batches (and display the first visual from these) before having issues, yet the no-logs/Tensorboard approaches seem to error out immediately, but I'm sure it's all related. Unfortunately, it's hard for me to tell from the code what all goes into setting encoder_length, although it's clearly a combination of the min and max encoder length parameters. Since those are just integers (0 and 24, per the tutorial), not sure what the issue may be.

Also, thanks for the tip on latest-and-greatest pytorch installs! I had forgotten that the pytorch channel wasn't in my current conda config. Sadly, updating to 1.6.0 didn't solve the problem. Maybe this explains it at least in part?

jdb78 commented 4 years ago

It looks like encoder_length is a tensor with more than one element or maybe of type float? Can you run with a debugger and confirm that? In case any of the passed length parameters are not integer, this might happen. I will add some validation tests.

emigre459 commented 4 years ago

Just submitted PR #82 to address this. Going to try and get W&B logging to work next!

emigre459 commented 4 years ago

@jdb78 Just found that #95 (which seems to be in release 0.5.0 if I'm not mistaken) doesn't seem to fix it on my end, so I'm re-opening this issue. Will try and work on an updated PR soon.

jdb78 commented 4 years ago

Are you on pytorch 1.6? Could you post the exact error?

emigre459 commented 4 years ago

I think I was having an issue with my dependencies. Re-testing doesn't turn up the error, so should be good after all. My mistake!