sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
4.01k stars 635 forks source link

Multi GPU Memory keeps increasing while training TFT #486

Open fabsen opened 3 years ago

fabsen commented 3 years ago

Expected behavior

I follow the tft tutorial but want to train on multiple GPUs.

Actual behavior

RAM usage increase drastically over time until we get a memory error (Cannot allocate memory ...)

Changing to log_interval=-1 gets rid of the problem. Also training on one GPU only doesn't increase RAM usage.

Code to reproduce the problem

Steps that differ from the tutorial:

  1. Omit the "learning rate finder" part
  2. add/replace these two lines in the pl.Trainer. gpus=[0, 1], accelerator='ddp',
  3. Increase max_epochs and early stopping such that it doesn't stop early

/edit: For clarification: RAM usage keeps increasing, not VRAM (which is okay).

jdb78 commented 3 years ago

Oh. This is interesting. Probably the figures are not probably closed. Thanks for pointing out. I wonder if this is an issue related to the pytorch lightning TensorboardLogger.

alexcolpitts96 commented 2 years ago

This still seems to be an issue. I had been training in a Docker container and thus not seeing the plots.

After training completed when not using a container, my system would almost crash from the sheer number of figures being opened. I will take a look at fixing the plot generation issue.

sayanb-7c6 commented 2 years ago

@fabsen Thank you for the solution of log_interval=-1. I faced the same issue while training in ddp mode on 4x NVidia V100. This was a major hurdle for scalability. Libraries I'm using:

pytorch-forecasting==0.9.0 pytorch-lightning==1.6.5 torch==1.11.0 torchmetrics==0.5.0

galigutta commented 1 year ago

I experienced the same issue today after trying to upgrade my environment after a while. I was already using log_interval = -1

I am not using DDP mode either. I do have a multiple GPU setup but am only using 1 GPU at a time using os.environ["CUDA_VISIBLE_DEVICES"]

library versions are:

pytorch-forecasting==1.0.0 pytorch-lightning==2.1.1 torch==2.0.1 torchmetrics==1.2.0

furkanbr commented 10 months ago

I experienced the same issue today after trying to upgrade my environment after a while. I was already using log_interval = -1

I am not using DDP mode either. I do have a multiple GPU setup but am only using 1 GPU at a time using os.environ["CUDA_VISIBLE_DEVICES"]

library versions are:

pytorch-forecasting==1.0.0 pytorch-lightning==2.1.1 torch==2.0.1 torchmetrics==1.2.0

I am facing with same problem. Tried log_interval=-1 but did not make any difference. Did you able to solve it?