Schedulers cause memory accumulation across folds in cross-validation?

JohnGiorgi commented 5 years ago

❓ Questions & Help

I am facing a strange issue when using the schedulers available in this library within a cross-validation loop. Basically, in each fold, I initialize a new model, optimizer, and scheduler. GPU memory accumulates until I eventually get a CUDA out of memory issue.

The simplest example I could come up with to reproduce the error is:

import torch
from pytorch_transformers import WarmupConstantSchedule, WarmupCosineSchedule, WarmupLinearSchedule, WarmupCosineWithHardRestartsSchedule

# In my actual project, this is a for loop over the k-folds of k-fold cross-validation.
# In this example I use a while just to demonstrate the OOM error.
while True:
    net = torch.nn.Linear(10000, 10000)
    net = net.cuda()

    optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)
    scheduler = WarmupCosineWithHardRestartsSchedule(optimizer, 1, 1000)

    # I also tried all the other schedulers. Same issue.
    # scheduler = WarmupConstantSchedule(optimizer, 1)
    # scheduler = WarmupCosineSchedule(optimizer, 1, 1000)
    # scheduler = WarmupLinearSchedule(optimizer, 1, 1000)

    del net, optimizer, scheduler

This will run until it (very quickly) uses up all 12GB on my Titan XP GPU. To make sure it was truly the initialization of the scheduler, I also tested

import torch
from pytorch_transformers import WarmupCosineWithHardRestartsSchedule

while True:
    net = torch.nn.Linear(10000, 10000)
    net = net.cuda()

    optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)

    del net, optimizer

And did not see the memory accumulation or OOM error.

My question(s) is/are:

Is this a known problem?
Am I doing something dumb?
How might I use a new scheduler for each fold of k-fold cross-validation in a way that doesn't lead to this issue?

Thanks a lot.

TIANRENK commented 5 years ago

I am facing the same issue.When I use the WarmupLinearSchedule and the 7th epoch training , I get a CUDA out of memory issue

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rlouf commented 4 years ago

Running import gc, thengc.collect() and emptying the GPU’s cache should solve the issue temporarily. See #1742

huggingface / transformers

Schedulers cause memory accumulation across folds in cross-validation? #1134

❓ Questions & Help