libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.84k stars 178 forks source link

Memory Leak in Ffcv Loader? #345

Closed OliverTED closed 1 year ago

OliverTED commented 1 year ago

I think you have a memory leak in ffcv:

for _ in range(10000):
    it  = iter(loader)
    print(threading.active_count())  # show increasing thread count
     # it.close(); it.join() # those lines would fix it again

To my current understanding this is the reason: The EpochIterator is a Thread object. This thread is stopped once EpochIterator is garbage collected. But a Thread object is only every garbage collected after it has stopped. Thus the manual 'close()' is necessary. Also a join is necessary for this more extreme stress test here.

MinghuiChen43 commented 1 year ago

I've come across the same memory leak problem while using ffcv. And the GPU memory just keeps on growing with each epoch during training.

MinghuiChen43 commented 1 year ago

Oh, I just found that there's a similar issue that's been solved (https://github.com/libffcv/ffcv/issues/206#issuecomment-1234249649). We shouldn't create an iterator within the loop when using ffcv loader.

andrewilyas commented 1 year ago

Hi all,

Sorry for the late response to this issue! @MinghuiChen43 is correct that creating an iterator within the loop is not supported (and you shouldn't have to, as just iterating through the loader without calling iter should work). Closing this now, but feel free to re-open in case other issues pop up!

OliverTED commented 1 year ago

Hi! Sorry, I could imagine here is a misunderstanding or I think this current state is not optimal.

For us the problem above came up in normal training. I traced the leak back to iterators in ffcv and the code above is just to show the problem.

As far as I understand the default way of writing e.g. LightningDataModules creates a new DataLoader per epoch. Its easy to recreate the dataloader in the training and even more so to create another iterator for the dataset. It would not 'feel' right to have an IteratableDataset, that leaks memory each time you iterate and the user has to remember to iterate only once.

And if that really is the way it should be, it at least should be in the documentation ("Attention: This iteratable dataset should be iterated only once, as each iterator leaks memory. Only create one ffcv Dataloader and only one iterator from it.").

But then on the other hand fixing it is quite simple and I can create a pull request.

Here is a quick fix for other people that have the same problem:

   loader = ffcv.loader.Loader(*args, **kwargs)

    # There is a mem leak in ffcv.  The following is a workaround:
    class FFCVClosingIterator:
        def __init__(self, it):
            self.it: EpochIterator = it

        def __next__(self):
            return next(self.it)

        def __iter__(self):
            return self

        def __del__(self):
            self.it.close()
            self.it.join()

    class FFCVClosingIterable:
        def __init__(self, it):
            self.it = it

        def __iter__(self):
            return FFCVClosingIterator(iter(self.it))

    loader = FFCVClosingIterable(loader)

This should resolve this memory leak in all use cases.