Open GLivshits opened 2 years ago
@GLivshits thanks fore reporting this.
What's special about GaussianBlur
is that it doesn't handle natively PIL images and does a conversion from PIL to Tensor and back. We'll have to check if there is a leak somewhere during that conversion but this is hard and might not be an issue on our side. It would help a lot if you can help us narrow this down a bit. Can you replace the PIL read with something like:
img = torchvision.io.read_file(self.images[idx])
You won't need a ToTensor() call in your transforms. Everything else remains the same. Do you still observe a memory leak?
Replaced PIL.Image.open with torchvision.io.read_image (it outputs uint8 tensor), still leaks.
import torch
import torchvision
from torchvision import transforms
class TestDataset(torch.utils.data.Dataset):
def __init__(self, images):
self.images = images
self.init_base_transform()
def __len__(self):
return len(self.images)
def init_base_transform(self):
self.tr_aug = transforms.Compose([transforms.GaussianBlur(7, (1, 5)),
transforms.Resize((256, 256)),
transforms.Normalize([0.5]*3, [0.5]*3)])
def __getitem__(self, idx):
img = torchvision.io.read_image(images[idx]).type(torch.float32).div(255.)
out = self.tr_aug(img)
return out
dataset = TestDataset(images)
dl = torch.utils.data.DataLoader(dataset, batch_size = 16, num_workers = 8, pin_memory = False)
while True:
for batch in dl:
pass
@GLivshits I tried to reproduce the issue by measuring the memory with your script + psutil:
import os
import psutil
import torch
from torchvision import transforms
from PIL import Image
class FramesDataset(torch.utils.data.Dataset):
def __init__(self, images):
self.images = images
self.init_base_transform()
def __len__(self):
return len(self.images)
def init_base_transform(self):
self.tr_aug = transforms.Compose(
[
transforms.GaussianBlur(7, (1, 5)),
transforms.Resize((256, 256), antialias=True),
transforms.ToTensor(),
transforms.Normalize([0.5]*3, [0.5]*3)
]
)
def __getitem__(self, idx):
img = Image.open(self.images[idx]).convert('RGB')
out = self.tr_aug(img)
return out
images = ["test-image.jpg" for _ in range(1000)]
dataset = FramesDataset(images)
dl = torch.utils.data.DataLoader(dataset, batch_size=16, num_workers=8, pin_memory=False)
p = psutil.Process(os.getpid())
epoch = 0
while epoch < 100:
mem_usage = p.memory_info().rss / 1024 / 1024
print(epoch, "- mem_usage:", mem_usage)
for batch in dl:
pass
epoch += 1
I did 2 experiments:
1) with GaussianBlur
as in the code above
2) without GaussianBlur
as
self.tr_aug = transforms.Compose(
[
transforms.Resize((256, 256), antialias=True),
transforms.ToTensor(),
transforms.Normalize([0.5]*3, [0.5]*3)
]
)
My pytorch, vision versions: '1.13.0.dev20220704+cpu', '0.14.0a0'
I see that in both logs mem consumption is growing. Can you detail how could you identify that it is GaussianBlur causing mem leak ?
@vfdef-5 I'm just watching htop. The thing is that if your images are of the same size everything works ok (I also tried loading dataset of one image and there is no memory leak). But if there are multiple sizes of images - leak appears. It seems like there is some memory reserved for blurring operation in dataset, and if a tensor of some new size comes - leak does appear);
If a single image of different sizes is used there is still a 4 GB RAM overhead.
@vfdev-5 I've excluded every other augmentation and swapped the order of blur and resize and found matching leaky configuration. The code provided is already a localized version of a problem.
🐛 Describe the bug
Hello. When using num_workers > 0 for dataloader and GaussianBlur BEFORE the resize function in transforms (images in dataset are of different size) a memory leak appears. The larger num_workers used, the more the leak is (I ran out of 128 GB RAM in 300 iterations with batch_size of 32 and num_workers of 16). To reproduce (you should initialize images with array of filepaths to images):
Versions
torch: 1.12.0+cu116 torchvision: 0.13.0+cu116 PIL: 9.0.0 Ubuntu: 20.04.4 LTS
cc @vfdev-5 @datumbox