rosinality / stylegan2-pytorch

Implementation of Analyzing and Improving the Image Quality of StyleGAN (StyleGAN 2) in PyTorch
MIT License
2.74k stars 623 forks source link

Memory leaks after running overnight #100

Open sweihub opened 4 years ago

sweihub commented 4 years ago

Hi There

I run the stylegan2-pytorch overnight, the memory increase from initial 3.5G to 22.6G, and the performance drops from 4s/it to 40s/it, it deteriorates 10 times, would you check the memory leaking issue?

Thanks!

rosinality commented 4 years ago

I haven't seen memory/performance problems like it. Could you specify conditions or parameters that you have used for training?

sweihub commented 4 years ago

Hi, thanks for prompt reply, I am fighting with the memory leaks issue, here is my enviornment and command line, the memory leaks issue is computer RAM not GPU, GPU seems stable.

Training parameters python train.py --iter 999999 --batch 16 --size 512 /data/db/portrait

What wired is, I use tracemalloc to trace the memory issue but the Python part seems normal, and I also call torch.cuda.empty_cache() at each epoch, there are the output of tracemalloc module.

[ Top 10 differences ] /usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py:110: size=7804 KiB (+7804 KiB), count=221927 (+221927), average=36 B

:580: size=300 KiB (+300 KiB), count=3266 (+3266), average=94 B /usr/lib/python3.8/tracemalloc.py:111: size=74.1 KiB (+74.1 KiB), count=948 (+948), average=80 B /usr/lib/python3.8/tracemalloc.py:125: size=68.7 KiB (+68.7 KiB), count=977 (+977), average=72 B :219: size=44.2 KiB (+44.2 KiB), count=417 (+417), average=109 B /usr/lib/python3.8/tracemalloc.py:472: size=38.6 KiB (+38.6 KiB), count=979 (+979), average=40 B /home/swei/Pod/stylegan2-portrait/op/fused_act.py:97: size=38.2 KiB (+38.2 KiB), count=294 (+294), average=133 B /home/swei/Pod/stylegan2-portrait/op/upfirdn2d.py:152: size=27.3 KiB (+27.3 KiB), count=199 (+199), average=140 B /home/swei/Pod/stylegan2-portrait/op/upfirdn2d.py:97: size=22.5 KiB (+22.5 KiB), count=344 (+344), average=67 B /home/swei/Pod/stylegan2-portrait/op/fused_act.py:58: size=15.1 KiB (+15.1 KiB), count=288 (+288), average=54 B ![1597465009012](https://user-images.githubusercontent.com/5334841/90310148-6767eb80-df21-11ea-9428-1e92eb066fa0.jpg)
sweihub commented 4 years ago

My tracemalloc instrumental code

diff --git a/train.py b/train.py
index 533091b..8ee93e1 100755
--- a/train.py
+++ b/train.py
@@ -10,6 +10,7 @@ from torch.utils import data
 import torch.distributed as dist
 from torchvision import transforms, utils
 from tqdm import tqdm
+import tracemalloc

 try:
     import wandb
@@ -297,6 +298,14 @@ def train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, devic
                 )
             )

+            # trace memory issue
+            torch.cuda.empty_cache()
+            snapshot2 = tracemalloc.take_snapshot()
+            top_stats = snapshot2.compare_to(snapshot1, 'lineno')
+            print("[ Top 10 differences ]")
+            for stat in top_stats[:10]:
+                print(stat)
+
             if wandb and args.wandb:
                 wandb.log(
                     {
@@ -464,4 +473,6 @@ if __name__ == "__main__":
     if get_rank() == 0 and wandb is not None and args.wandb:
         wandb.init(project="stylegan 2")

+    tracemalloc.start()
+    snapshot1 = tracemalloc.take_snapshot()
     train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, device)
sweihub commented 4 years ago

The memory increases about 1MB every seconds, I have to workaround it, exit and restart by a bash script.

diff --git a/train.py b/train.py
index 533091b..bbe7058 100755
--- a/train.py
+++ b/train.py
@@ -10,6 +10,7 @@ from torch.utils import data
 import torch.distributed as dist
 from torchvision import transforms, utils
 from tqdm import tqdm
+import psutil

 try:
     import wandb
@@ -28,6 +29,13 @@ from distributed import (
 )
 from non_leaking import augment

+def memory_leaks():
+    process = psutil.Process(os.getpid())
+    rss = process.memory_info().rss / (1024.0 * 1024 * 1024)
+    if rss > 5.0:
+        print("current memory usage: %0.2f GB" % rss)
+        return True
+    return False

 def data_sampler(dataset, shuffle, distributed):
     if distributed:
@@ -325,7 +333,7 @@ def train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, devic
                         range=(-1, 1),
                     )

+            quit = any_key_pressed() or memory_leaks()
             if i % 10000 == 0 or quit:
                 torch.save(
                     {
rosinality commented 4 years ago

Hmm maybe it is pytorch 1.6 & custom operator related problems. I will check about it.

sweihub commented 4 years ago

Update: It might be related to dataset size / data loader / LMDB issue ...

I decreased the dataset size from 200,000 images to 10,000, now the memory usage is stable at 4.73G, should we set a memory limit to the LMDB? Even though my computer has 128G RAM, but the process takes more memory will perform worse, I guess it reaches 10G or more will trigger the performance degrade.

rosinality commented 4 years ago

Yes, maybe it is related to that. I have used lmdb for larger size than that without problems. But lmdb can have subtle interactions with multiprocessing. So I will try to use better mechanisms.