sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.96k stars 629 forks source link

Error: cannot allocate memory #183

Closed vlavorini closed 3 years ago

vlavorini commented 3 years ago

Hello, I'm trying to follow the tutorial but with my own data. The dataset size is 14 MB.

When I run the learning rate finder, i got this error:

RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 39424204800 bytes. Error code 12 (Cannot allocate memory)

but the dataset is actually so small, and the server I am using has 96 GB of RAM.

What am I missing?

Full traceback:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-14-a92b5627800b> in <module>
      5     val_dataloaders=val_dataloader,
      6     max_lr=10.0,
----> 7     min_lr=1e-6,
      8 )
      9 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/tuner/tuning.py in lr_find(self, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold, datamodule)
    128             mode,
    129             early_stop_threshold,
--> 130             datamodule,
    131         )
    132 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/tuner/lr_finder.py in lr_find(trainer, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold, datamodule)
    170                 train_dataloader=train_dataloader,
    171                 val_dataloaders=val_dataloaders,
--> 172                 datamodule=datamodule)
    173 
    174     # Prompt if we stopped early

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    443         self.call_hook('on_fit_start')
    444 
--> 445         results = self.accelerator_backend.train()
    446         self.accelerator_backend.teardown()
    447 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/accelerators/cpu_accelerator.py in train(self)
     57 
     58         # train or test
---> 59         results = self.train_or_test()
     60         return results
     61 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
     64             results = self.trainer.run_test()
     65         else:
---> 66             results = self.trainer.train()
     67         return results
     68 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
    492 
    493                 # run train epoch
--> 494                 self.train_loop.run_training_epoch()
    495 
    496                 if self.max_steps and self.max_steps <= self.global_step:

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    559             # TRAINING_STEP + TRAINING_STEP_END
    560             # ------------------------------------
--> 561             batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
    562 
    563             # when returning -1 from train_step, we end epoch early

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
    726 
    727                         # optimizer step
--> 728                         self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    729 
    730                     else:

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_closure, *args, **kwargs)
    468             # optimizer step lightningModule hook
    469             self.trainer.accelerator_backend.optimizer_step(
--> 470                 optimizer, batch_idx, opt_idx, train_step_and_backward_closure, *args, **kwargs
    471             )
    472 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py in optimizer_step(self, optimizer, batch_idx, opt_idx, lambda_closure, *args, **kwargs)
    122             using_lbfgs=is_lbfgs,
    123             *args,
--> 124             **kwargs,
    125         )
    126 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py in optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs, *args, **kwargs)
   1378             optimizer.step(*args, **kwargs)
   1379         else:
-> 1380             optimizer.step(closure=optimizer_closure, *args, **kwargs)
   1381 
   1382     def optimizer_zero_grad(

~/miniconda3/envs/py37/lib/python3.7/site-packages/torch/optim/lr_scheduler.py in wrapper(*args, **kwargs)
     65                 instance._step_count += 1
     66                 wrapped = func.__get__(instance, cls)
---> 67                 return wrapped(*args, **kwargs)
     68 
     69             # Note that the returned function here is no longer a bound method,

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_forecasting/optim.py in step(self, closure)
    129             closure: A closure that reevaluates the model and returns the loss.
    130         """
--> 131         _ = closure()
    132         loss = None
    133         # note - below is commented out b/c I have other work that passes back

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in train_step_and_backward_closure()
    721                                 opt_idx,
    722                                 optimizer,
--> 723                                 self.trainer.hiddens
    724                             )
    725                             return None if result is None else result.loss

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
    811         """
    812         # lightning module hook
--> 813         result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
    814         self._curr_step_result = result
    815 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py in training_step(self, split_batch, batch_idx, opt_idx, hiddens)
    318         with self.trainer.profiler.profile("model_forward"):
    319             args = self.build_train_args(split_batch, batch_idx, opt_idx, hiddens)
--> 320             training_step_output = self.trainer.accelerator_backend.training_step(args)
    321             self._check_training_step_output(training_step_output)
    322 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_lightning/accelerators/cpu_accelerator.py in training_step(self, args)
     65                 output = self.trainer.model.training_step(*args)
     66         else:
---> 67             output = self.trainer.model.training_step(*args)
     68         return output
     69 

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_forecasting/models/base_model.py in training_step(self, batch, batch_idx)
    154         """
    155         x, y = batch
--> 156         log, _ = self.step(x, y, batch_idx, label="train")
    157         # log loss
    158         self.log("train_loss", log["loss"], on_step=True, on_epoch=True, prog_bar=True)

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in step(self, x, y, batch_idx, label)
    520         """
    521         # extract data and run model
--> 522         log, out = super().step(x, y, batch_idx, label=label)
    523         # calculate interpretations etc for latter logging
    524         if self.log_interval(label == "train") > 0:

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_forecasting/models/base_model.py in step(self, x, y, batch_idx, label, **kwargs)
    220             loss = loss * (1 + monotinicity_loss)
    221         else:
--> 222             out = self(x, **kwargs)
    223             out["prediction"] = self.transform_output(out)
    224 

~/miniconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in forward(self, x)
    484             v=attn_input,
    485             mask=self.get_attention_mask(
--> 486                 encoder_lengths=encoder_lengths, decoder_length=timesteps - max_encoder_length
    487             ),
    488         )

~/miniconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/sub_modules.py in forward(self, q, k, v, mask)
    431             qs = self.q_layers[i](q)
    432             ks = self.k_layers[i](k)
--> 433             head, attn = self.attention(qs, ks, vs, mask)
    434             head_dropout = self.dropout(head)
    435             heads.append(head_dropout)

~/miniconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/miniconda3/envs/py37/lib/python3.7/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/sub_modules.py in forward(self, q, k, v, mask)
    388         if self.scale:
    389             dimension = torch.sqrt(torch.tensor(k.shape[-1]).to(torch.float32))
--> 390             attn = attn / dimension
    391 
    392         if mask is not None:

RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 39424204800 bytes. Error code 12 (Cannot allocate memory)
LukeOliv commented 3 years ago

Every time I've had that error it has been because there isn't enough memory currently free on the pc. The demands are pretty big.

jdb78 commented 3 years ago

What is your network size, batch size and max encoder length and max prediction length?

vlavorini commented 3 years ago

So fast, thank you!

Number of parameters in network: 43.6k batch size 128 max prediction length 8760

And the memory is almost all free when I execute the code

jdb78 commented 3 years ago

Your prediction length is too large for training (maybe not for inference). 43.6k (8760 + max_encoder_length) 128 is a very large number. Either resample your time series to a higher time aggregation or only use such long prediction lengths for prediction. I recommend the former because you are then consistent in training and predicting.

vlavorini commented 3 years ago

I've reduced the prediction length to what I really need, and of course it works now. Thank you!