pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.04k stars 6.93k forks source link

Memory leak on GaussianBlur #6437

Open GLivshits opened 2 years ago

GLivshits commented 2 years ago

🐛 Describe the bug

Hello. When using num_workers > 0 for dataloader and GaussianBlur BEFORE the resize function in transforms (images in dataset are of different size) a memory leak appears. The larger num_workers used, the more the leak is (I ran out of 128 GB RAM in 300 iterations with batch_size of 32 and num_workers of 16). To reproduce (you should initialize images with array of filepaths to images):

import torch
import glob
from torchvision import transforms
from PIL import Image
class FramesDataset(torch.utils.data.Dataset):
    def __init__(self, images):
        self.images = images
        self.init_base_transform()

    def __len__(self):
        return len(self.images)

    def init_base_transform(self):
        self.tr_aug = transforms.Compose([transforms.GaussianBlur(7, (1, 5)),
                                          transforms.Resize((256, 256), antialias=True),
                                          transforms.ToTensor(),
                                          transforms.Normalize([0.5]*3, [0.5]*3) ])

    def __getitem__(self, idx):
        img = Image.open(self.images[idx]).convert('RGB')
        out = self.tr_aug(img)
        return out

dataset = TestDataset(images)
dl = torch.utils.data.DataLoader(dataset, batch_size = 16, num_workers = 8, pin_memory = False)
while True:
    for batch in dl:
        pass

Versions

torch: 1.12.0+cu116 torchvision: 0.13.0+cu116 PIL: 9.0.0 Ubuntu: 20.04.4 LTS

cc @vfdev-5 @datumbox

datumbox commented 2 years ago

@GLivshits thanks fore reporting this.

What's special about GaussianBlur is that it doesn't handle natively PIL images and does a conversion from PIL to Tensor and back. We'll have to check if there is a leak somewhere during that conversion but this is hard and might not be an issue on our side. It would help a lot if you can help us narrow this down a bit. Can you replace the PIL read with something like:

img = torchvision.io.read_file(self.images[idx])

You won't need a ToTensor() call in your transforms. Everything else remains the same. Do you still observe a memory leak?

GLivshits commented 2 years ago

Replaced PIL.Image.open with torchvision.io.read_image (it outputs uint8 tensor), still leaks.

import torch
import torchvision
from torchvision import transforms

class TestDataset(torch.utils.data.Dataset):

    def __init__(self, images):
        self.images = images
        self.init_base_transform()

    def __len__(self):
        return len(self.images)

    def init_base_transform(self):
        self.tr_aug = transforms.Compose([transforms.GaussianBlur(7, (1, 5)),
                                          transforms.Resize((256, 256)),
                                          transforms.Normalize([0.5]*3, [0.5]*3)])

    def __getitem__(self, idx):
        img = torchvision.io.read_image(images[idx]).type(torch.float32).div(255.)
        out = self.tr_aug(img)
        return out

dataset = TestDataset(images)
dl = torch.utils.data.DataLoader(dataset, batch_size = 16, num_workers = 8, pin_memory = False)
while True:
    for batch in dl:
        pass
vfdev-5 commented 2 years ago

@GLivshits I tried to reproduce the issue by measuring the memory with your script + psutil:

import os
import psutil

import torch
from torchvision import transforms
from PIL import Image

class FramesDataset(torch.utils.data.Dataset):
    def __init__(self, images):
        self.images = images
        self.init_base_transform()

    def __len__(self):
        return len(self.images)

    def init_base_transform(self):
        self.tr_aug = transforms.Compose(
            [
                transforms.GaussianBlur(7, (1, 5)),
                transforms.Resize((256, 256), antialias=True),
                transforms.ToTensor(),
                transforms.Normalize([0.5]*3, [0.5]*3) 
            ]
        )

    def __getitem__(self, idx):
        img = Image.open(self.images[idx]).convert('RGB')
        out = self.tr_aug(img)
        return out

images = ["test-image.jpg" for _ in range(1000)]

dataset = FramesDataset(images)
dl = torch.utils.data.DataLoader(dataset, batch_size=16, num_workers=8, pin_memory=False)

p = psutil.Process(os.getpid())

epoch = 0
while epoch < 100:
    mem_usage = p.memory_info().rss / 1024 / 1024
    print(epoch, "- mem_usage:", mem_usage)
    for batch in dl:
        pass
    epoch += 1

I did 2 experiments: 1) with GaussianBlur as in the code above

Output ``` 0 - mem_usage: 204.5625 1 - mem_usage: 208.80078125 2 - mem_usage: 208.8359375 3 - mem_usage: 208.85546875 4 - mem_usage: 208.8671875 5 - mem_usage: 208.87890625 6 - mem_usage: 208.88671875 7 - mem_usage: 208.91015625 8 - mem_usage: 208.91015625 9 - mem_usage: 208.921875 10 - mem_usage: 208.921875 11 - mem_usage: 208.92578125 12 - mem_usage: 208.9296875 13 - mem_usage: 208.9375 14 - mem_usage: 208.9375 15 - mem_usage: 208.9375 16 - mem_usage: 208.94140625 17 - mem_usage: 208.9453125 18 - mem_usage: 208.9453125 19 - mem_usage: 208.94921875 20 - mem_usage: 208.94921875 21 - mem_usage: 208.95703125 22 - mem_usage: 208.9609375 23 - mem_usage: 208.9609375 24 - mem_usage: 208.96484375 25 - mem_usage: 208.96875 26 - mem_usage: 208.97265625 27 - mem_usage: 208.98828125 28 - mem_usage: 209.0 29 - mem_usage: 209.0 30 - mem_usage: 209.0 31 - mem_usage: 209.0 32 - mem_usage: 209.0 33 - mem_usage: 209.0 34 - mem_usage: 209.0 35 - mem_usage: 209.00390625 36 - mem_usage: 209.00390625 37 - mem_usage: 209.015625 38 - mem_usage: 209.015625 39 - mem_usage: 209.015625 40 - mem_usage: 209.015625 41 - mem_usage: 209.015625 42 - mem_usage: 209.01953125 43 - mem_usage: 209.01953125 44 - mem_usage: 209.0234375 45 - mem_usage: 209.0234375 46 - mem_usage: 209.03515625 47 - mem_usage: 209.03515625 48 - mem_usage: 209.03515625 49 - mem_usage: 209.03515625 50 - mem_usage: 209.03515625 51 - mem_usage: 209.03515625 52 - mem_usage: 209.03515625 53 - mem_usage: 209.03515625 54 - mem_usage: 209.03515625 55 - mem_usage: 209.0390625 56 - mem_usage: 209.0390625 57 - mem_usage: 209.04296875 58 - mem_usage: 209.04296875 59 - mem_usage: 209.04296875 60 - mem_usage: 209.04296875 61 - mem_usage: 209.04296875 62 - mem_usage: 209.04296875 63 - mem_usage: 209.04296875 64 - mem_usage: 209.04296875 65 - mem_usage: 209.04296875 66 - mem_usage: 209.04296875 67 - mem_usage: 209.04296875 68 - mem_usage: 209.04296875 69 - mem_usage: 209.04296875 70 - mem_usage: 209.04296875 71 - mem_usage: 209.04296875 72 - mem_usage: 209.04296875 73 - mem_usage: 209.046875 74 - mem_usage: 209.046875 75 - mem_usage: 209.046875 76 - mem_usage: 209.046875 77 - mem_usage: 209.046875 78 - mem_usage: 209.046875 79 - mem_usage: 209.046875 80 - mem_usage: 209.046875 81 - mem_usage: 209.046875 82 - mem_usage: 209.046875 83 - mem_usage: 209.046875 84 - mem_usage: 209.046875 85 - mem_usage: 209.046875 86 - mem_usage: 209.0546875 87 - mem_usage: 209.05859375 88 - mem_usage: 209.0625 89 - mem_usage: 209.0625 90 - mem_usage: 209.0625 91 - mem_usage: 209.0625 92 - mem_usage: 209.0625 93 - mem_usage: 209.0625 94 - mem_usage: 209.0625 95 - mem_usage: 209.0625 96 - mem_usage: 209.0625 97 - mem_usage: 209.0625 98 - mem_usage: 209.0625 99 - mem_usage: 209.0625 ```

2) without GaussianBlur as

        self.tr_aug = transforms.Compose(
            [
                transforms.Resize((256, 256), antialias=True),
                transforms.ToTensor(),
                transforms.Normalize([0.5]*3, [0.5]*3) 
            ]
        )
Output ``` 0 - mem_usage: 204.5703125 1 - mem_usage: 209.0546875 2 - mem_usage: 209.09765625 3 - mem_usage: 209.109375 4 - mem_usage: 209.1171875 5 - mem_usage: 209.12109375 6 - mem_usage: 209.1328125 7 - mem_usage: 209.15625 8 - mem_usage: 209.19140625 9 - mem_usage: 209.1953125 10 - mem_usage: 209.203125 11 - mem_usage: 209.203125 12 - mem_usage: 209.20703125 13 - mem_usage: 209.20703125 14 - mem_usage: 209.2109375 15 - mem_usage: 209.2109375 16 - mem_usage: 209.21484375 17 - mem_usage: 209.234375 18 - mem_usage: 209.23828125 19 - mem_usage: 209.23828125 20 - mem_usage: 209.24609375 21 - mem_usage: 209.25 22 - mem_usage: 209.25 23 - mem_usage: 209.25 24 - mem_usage: 209.25390625 25 - mem_usage: 209.26171875 26 - mem_usage: 209.26171875 27 - mem_usage: 209.265625 28 - mem_usage: 209.265625 29 - mem_usage: 209.265625 30 - mem_usage: 209.265625 31 - mem_usage: 209.265625 32 - mem_usage: 209.265625 33 - mem_usage: 209.27734375 34 - mem_usage: 209.2890625 35 - mem_usage: 209.2890625 36 - mem_usage: 209.2890625 37 - mem_usage: 209.296875 38 - mem_usage: 209.30078125 39 - mem_usage: 209.3046875 40 - mem_usage: 209.30859375 41 - mem_usage: 209.30859375 42 - mem_usage: 209.3125 43 - mem_usage: 209.3203125 44 - mem_usage: 209.328125 45 - mem_usage: 209.33203125 46 - mem_usage: 209.33203125 47 - mem_usage: 209.33203125 48 - mem_usage: 209.33203125 49 - mem_usage: 209.33203125 50 - mem_usage: 209.33203125 51 - mem_usage: 209.33203125 52 - mem_usage: 209.33203125 53 - mem_usage: 209.33203125 54 - mem_usage: 209.33203125 55 - mem_usage: 209.33203125 56 - mem_usage: 209.33203125 57 - mem_usage: 209.33203125 58 - mem_usage: 209.33203125 59 - mem_usage: 209.33203125 60 - mem_usage: 209.33203125 61 - mem_usage: 209.33203125 62 - mem_usage: 209.3359375 63 - mem_usage: 209.3359375 64 - mem_usage: 209.3359375 65 - mem_usage: 209.3359375 66 - mem_usage: 209.3359375 67 - mem_usage: 209.3359375 68 - mem_usage: 209.3359375 69 - mem_usage: 209.3359375 70 - mem_usage: 209.3359375 71 - mem_usage: 209.3359375 72 - mem_usage: 209.3359375 73 - mem_usage: 209.3359375 74 - mem_usage: 209.3359375 75 - mem_usage: 209.3359375 76 - mem_usage: 209.3359375 77 - mem_usage: 209.3359375 78 - mem_usage: 209.3359375 79 - mem_usage: 209.3359375 80 - mem_usage: 209.3359375 81 - mem_usage: 209.33984375 82 - mem_usage: 209.33984375 83 - mem_usage: 209.33984375 84 - mem_usage: 209.33984375 85 - mem_usage: 209.33984375 86 - mem_usage: 209.33984375 87 - mem_usage: 209.33984375 88 - mem_usage: 209.33984375 89 - mem_usage: 209.33984375 90 - mem_usage: 209.33984375 91 - mem_usage: 209.33984375 92 - mem_usage: 209.33984375 93 - mem_usage: 209.33984375 94 - mem_usage: 209.33984375 95 - mem_usage: 209.33984375 96 - mem_usage: 209.33984375 97 - mem_usage: 209.33984375 98 - mem_usage: 209.33984375 99 - mem_usage: 209.34375 ```

My pytorch, vision versions: '1.13.0.dev20220704+cpu', '0.14.0a0'

I see that in both logs mem consumption is growing. Can you detail how could you identify that it is GaussianBlur causing mem leak ?

GLivshits commented 2 years ago

@vfdef-5 I'm just watching htop. The thing is that if your images are of the same size everything works ok (I also tried loading dataset of one image and there is no memory leak). But if there are multiple sizes of images - leak appears. It seems like there is some memory reserved for blurring operation in dataset, and if a tensor of some new size comes - leak does appear);

GLivshits commented 2 years ago

If a single image of different sizes is used there is still a 4 GB RAM overhead.

GLivshits commented 2 years ago

@vfdev-5 I've excluded every other augmentation and swapped the order of blur and resize and found matching leaky configuration. The code provided is already a localized version of a problem.