Open sweihub opened 4 years ago
I haven't seen memory/performance problems like it. Could you specify conditions or parameters that you have used for training?
Hi, thanks for prompt reply, I am fighting with the memory leaks issue, here is my enviornment and command line, the memory leaks issue is computer RAM not GPU, GPU seems stable.
Training parameters
python train.py --iter 999999 --batch 16 --size 512 /data/db/portrait
What wired is, I use tracemalloc to trace the memory issue but the Python part seems normal, and I also call torch.cuda.empty_cache() at each epoch, there are the output of tracemalloc module.
[ Top 10 differences ] /usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py:110: size=7804 KiB (+7804 KiB), count=221927 (+221927), average=36 B
My tracemalloc instrumental code
diff --git a/train.py b/train.py
index 533091b..8ee93e1 100755
--- a/train.py
+++ b/train.py
@@ -10,6 +10,7 @@ from torch.utils import data
import torch.distributed as dist
from torchvision import transforms, utils
from tqdm import tqdm
+import tracemalloc
try:
import wandb
@@ -297,6 +298,14 @@ def train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, devic
)
)
+ # trace memory issue
+ torch.cuda.empty_cache()
+ snapshot2 = tracemalloc.take_snapshot()
+ top_stats = snapshot2.compare_to(snapshot1, 'lineno')
+ print("[ Top 10 differences ]")
+ for stat in top_stats[:10]:
+ print(stat)
+
if wandb and args.wandb:
wandb.log(
{
@@ -464,4 +473,6 @@ if __name__ == "__main__":
if get_rank() == 0 and wandb is not None and args.wandb:
wandb.init(project="stylegan 2")
+ tracemalloc.start()
+ snapshot1 = tracemalloc.take_snapshot()
train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, device)
The memory increases about 1MB every seconds, I have to workaround it, exit and restart by a bash script.
diff --git a/train.py b/train.py
index 533091b..bbe7058 100755
--- a/train.py
+++ b/train.py
@@ -10,6 +10,7 @@ from torch.utils import data
import torch.distributed as dist
from torchvision import transforms, utils
from tqdm import tqdm
+import psutil
try:
import wandb
@@ -28,6 +29,13 @@ from distributed import (
)
from non_leaking import augment
+def memory_leaks():
+ process = psutil.Process(os.getpid())
+ rss = process.memory_info().rss / (1024.0 * 1024 * 1024)
+ if rss > 5.0:
+ print("current memory usage: %0.2f GB" % rss)
+ return True
+ return False
def data_sampler(dataset, shuffle, distributed):
if distributed:
@@ -325,7 +333,7 @@ def train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, devic
range=(-1, 1),
)
+ quit = any_key_pressed() or memory_leaks()
if i % 10000 == 0 or quit:
torch.save(
{
Hmm maybe it is pytorch 1.6 & custom operator related problems. I will check about it.
Update: It might be related to dataset size / data loader / LMDB issue ...
I decreased the dataset size from 200,000 images to 10,000, now the memory usage is stable at 4.73G, should we set a memory limit to the LMDB? Even though my computer has 128G RAM, but the process takes more memory will perform worse, I guess it reaches 10G or more will trigger the performance degrade.
Yes, maybe it is related to that. I have used lmdb for larger size than that without problems. But lmdb can have subtle interactions with multiprocessing. So I will try to use better mechanisms.
Hi There
I run the stylegan2-pytorch overnight, the memory increase from initial 3.5G to 22.6G, and the performance drops from 4s/it to 40s/it, it deteriorates 10 times, would you check the memory leaking issue?
Thanks!