Closed emigre459 closed 4 years ago
If you want to switch logging off, just pass log_interval=-1
. If you are aware of a unified logging interface, I would be keen to use it. Alternatively, one could
__init__
arguments to the models._log_prediction()
From a user perspective option 1 would be preferable. Any thoughts?
For tensorboard, keep in mind that you should deinstall tensorflow
or use the workaround in #58. To install pytorch >=1.6 with conda, you have to use the pytorch channel, i.e. conda install pytorch -c pytorch
.
OK, cool. Once I can get my example up and running, I might take a whack at the unified logging interface, as I'm not aware of such a thing right now. When I disable logging as you've said by setting log_interval=-1
, I get the same error I saw earlier when trying to log via Tensorboard instead of W&B (I checked and my environment does not have TensorFlow installed, but does have tensorboard=2.2.0
). Here's the new traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-562ea3edbba3> in <module>
25 tft,
26 train_dataloader=train_dataloader,
---> 27 val_dataloaders=val_dataloader
28 )
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py in wrapped_fn(self, *args, **kwargs)
46 if entering is not None:
47 self.state = entering
---> 48 result = fn(self, *args, **kwargs)
49
50 # The INTERRUPTED state can be set inside the run function. To indicate that run was interrupted
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
1082 self.accelerator_backend = CPUBackend(self)
1083 self.accelerator_backend.setup(model)
-> 1084 results = self.accelerator_backend.train(model)
1085
1086 # on fit end callback
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/accelerators/cpu_backend.py in train(self, model)
37
38 def train(self, model):
---> 39 results = self.trainer.run_pretrain_routine(model)
40 return results
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
1237
1238 # CORE TRAINING LOOP
-> 1239 self.train()
1240
1241 def _run_sanity_check(self, ref_model, model):
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in train(self)
392 # RUN TNG EPOCH
393 # -----------------
--> 394 self.run_training_epoch()
395
396 if self.max_steps and self.max_steps <= self.global_step:
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
477 # run epoch
478 for batch_idx, (batch, is_last_batch) in self.profiler.profile_iterable(
--> 479 enumerate(_with_is_last(train_dataloader)), "get_train_batch"
480 ):
481 # stop epoch if we limited the number of training batches
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/profiler/profilers.py in profile_iterable(self, iterable, action_name)
76 try:
77 self.start(action_name)
---> 78 value = next(iterator)
79 self.stop(action_name)
80 yield value
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in _with_is_last(iterable)
1320 See `https://stackoverflow.com/a/1630350 <https://stackoverflow.com/a/1630350>`_"""
1321 it = iter(iterable)
-> 1322 last = next(it)
1323 for val in it:
1324 # yield last and has next
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
343
344 def __next__(self):
--> 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
--> 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
/opt/conda/envs/DIU_NORAD/lib/python3.7/site-packages/pytorch_forecasting/data/timeseries.py in __getitem__(self, idx)
965 x_cont=data_cont,
966 encoder_length=encoder_length,
--> 967 encoder_target=target[:encoder_length],
968 encoder_time_idx_start=time[0],
969 groups=groups,
TypeError: only integer tensors of a single element can be converted to an index
It seems to occur right after Lightning does the validation sanity checks, as part of building the first training batch. Not sure why W&B approach seemed able to complete 10-20 batches (and display the first visual from these) before having issues, yet the no-logs/Tensorboard approaches seem to error out immediately, but I'm sure it's all related. Unfortunately, it's hard for me to tell from the code what all goes into setting encoder_length
, although it's clearly a combination of the min and max encoder length parameters. Since those are just integers (0 and 24, per the tutorial), not sure what the issue may be.
Also, thanks for the tip on latest-and-greatest pytorch installs! I had forgotten that the pytorch channel wasn't in my current conda config. Sadly, updating to 1.6.0 didn't solve the problem. Maybe this explains it at least in part?
It looks like encoder_length
is a tensor with more than one element or maybe of type float? Can you run with a debugger and confirm that? In case any of the passed length parameters are not integer, this might happen. I will add some validation tests.
Just submitted PR #82 to address this. Going to try and get W&B logging to work next!
@jdb78 Just found that #95 (which seems to be in release 0.5.0 if I'm not mistaken) doesn't seem to fix it on my end, so I'm re-opening this issue. Will try and work on an updated PR soon.
Are you on pytorch 1.6? Could you post the exact error?
I think I was having an issue with my dependencies. Re-testing doesn't turn up the error, so should be good after all. My mistake!
I'm seeing issues trying to run the W&B logger when replicating the Towards Data Science example with the Stallion dataset (to be fair, switching to TensorBoard made it fail too, although for a completely different-sounding reason oddly). When I try to train the model, I get a single graphical output (attached) and then it errors out. I'm using:
I know the requirements.txt indicates (py)torch >= 1.6, but I can't get conda to find a good solution for that in my dependency tree, and this seems to be a logger issue anyhow. Here's the full traceback:
It seems as though
pytorch_forecasting
is assuming that all loggers have the sameadd_figure()
method, but clearly that's not the case in this version of W&B/pytorch-lightning. Any thoughts on way to rectify this? I'd also be game for a workaround to disable the default figure generation during training, although it is very nice to get those super-informative figures so I'd rather get them working if I could!