Memory leaks after running overnight

sweihub commented 4 years ago

Hi There

I run the stylegan2-pytorch overnight, the memory increase from initial 3.5G to 22.6G, and the performance drops from 4s/it to 40s/it, it deteriorates 10 times, would you check the memory leaking issue?

Thanks!

rosinality commented 4 years ago

I haven't seen memory/performance problems like it. Could you specify conditions or parameters that you have used for training?

sweihub commented 4 years ago

Hi, thanks for prompt reply, I am fighting with the memory leaks issue, here is my enviornment and command line, the memory leaks issue is computer RAM not GPU, GPU seems stable.

ubuntu 20.04 LTS
cuda 11
PyTorch 1.6.0

Training parameters python train.py --iter 999999 --batch 16 --size 512 /data/db/portrait

What wired is, I use tracemalloc to trace the memory issue but the Python part seems normal, and I also call torch.cuda.empty_cache() at each epoch, there are the output of tracemalloc module.

[ Top 10 differences ] /usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py:110: size=7804 KiB (+7804 KiB), count=221927 (+221927), average=36 B

:580: size=300 KiB (+300 KiB), count=3266 (+3266), average=94 B /usr/lib/python3.8/tracemalloc.py:111: size=74.1 KiB (+74.1 KiB), count=948 (+948), average=80 B /usr/lib/python3.8/tracemalloc.py:125: size=68.7 KiB (+68.7 KiB), count=977 (+977), average=72 B :219: size=44.2 KiB (+44.2 KiB), count=417 (+417), average=109 B /usr/lib/python3.8/tracemalloc.py:472: size=38.6 KiB (+38.6 KiB), count=979 (+979), average=40 B /home/swei/Pod/stylegan2-portrait/op/fused_act.py:97: size=38.2 KiB (+38.2 KiB), count=294 (+294), average=133 B /home/swei/Pod/stylegan2-portrait/op/upfirdn2d.py:152: size=27.3 KiB (+27.3 KiB), count=199 (+199), average=140 B /home/swei/Pod/stylegan2-portrait/op/upfirdn2d.py:97: size=22.5 KiB (+22.5 KiB), count=344 (+344), average=67 B /home/swei/Pod/stylegan2-portrait/op/fused_act.py:58: size=15.1 KiB (+15.1 KiB), count=288 (+288), average=54 B ![1597465009012](https://user-images.githubusercontent.com/5334841/90310148-6767eb80-df21-11ea-9428-1e92eb066fa0.jpg)

sweihub commented 4 years ago

My tracemalloc instrumental code

diff --git a/train.py b/train.py
index 533091b..8ee93e1 100755
--- a/train.py
+++ b/train.py
@@ -10,6 +10,7 @@ from torch.utils import data
 import torch.distributed as dist
 from torchvision import transforms, utils
 from tqdm import tqdm
+import tracemalloc

 try:
     import wandb
@@ -297,6 +298,14 @@ def train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, devic
                 )
             )

+            # trace memory issue
+            torch.cuda.empty_cache()
+            snapshot2 = tracemalloc.take_snapshot()
+            top_stats = snapshot2.compare_to(snapshot1, 'lineno')
+            print("[ Top 10 differences ]")
+            for stat in top_stats[:10]:
+                print(stat)
+
             if wandb and args.wandb:
                 wandb.log(
                     {
@@ -464,4 +473,6 @@ if __name__ == "__main__":
     if get_rank() == 0 and wandb is not None and args.wandb:
         wandb.init(project="stylegan 2")

+    tracemalloc.start()
+    snapshot1 = tracemalloc.take_snapshot()
     train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, device)

sweihub commented 4 years ago

The memory increases about 1MB every seconds, I have to workaround it, exit and restart by a bash script.

diff --git a/train.py b/train.py
index 533091b..bbe7058 100755
--- a/train.py
+++ b/train.py
@@ -10,6 +10,7 @@ from torch.utils import data
 import torch.distributed as dist
 from torchvision import transforms, utils
 from tqdm import tqdm
+import psutil

 try:
     import wandb
@@ -28,6 +29,13 @@ from distributed import (
 )
 from non_leaking import augment

+def memory_leaks():
+    process = psutil.Process(os.getpid())
+    rss = process.memory_info().rss / (1024.0 * 1024 * 1024)
+    if rss > 5.0:
+        print("current memory usage: %0.2f GB" % rss)
+        return True
+    return False

 def data_sampler(dataset, shuffle, distributed):
     if distributed:
@@ -325,7 +333,7 @@ def train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, devic
                         range=(-1, 1),
                     )

+            quit = any_key_pressed() or memory_leaks()
             if i % 10000 == 0 or quit:
                 torch.save(
                     {

rosinality commented 4 years ago

Hmm maybe it is pytorch 1.6 & custom operator related problems. I will check about it.

sweihub commented 4 years ago

Update: It might be related to dataset size / data loader / LMDB issue ...

I decreased the dataset size from 200,000 images to 10,000, now the memory usage is stable at 4.73G, should we set a memory limit to the LMDB? Even though my computer has 128G RAM, but the process takes more memory will perform worse, I guess it reaches 10G or more will trigger the performance degrade.

rosinality commented 4 years ago

Yes, maybe it is related to that. I have used lmdb for larger size than that without problems. But lmdb can have subtle interactions with multiprocessing. So I will try to use better mechanisms.

rosinality / stylegan2-pytorch

Memory leaks after running overnight #100