libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.81k stars 180 forks source link

Memory Leaks #206

Closed rmrafailov closed 1 year ago

rmrafailov commented 2 years ago

Thank you very much for this project! I've been trying to implement this in training pipeline, but have been running into some issues. I created a custom dataset consisting of a combination of images, arrays and scalars. I then create a loader in the following way:

loader = Loader('custom.beton', 
                batch_size=256, 
                num_workers=12,
                os_cache = False,
                batches_ahead = 3,
                order=OrderOption.QUASI_RANDOM)

and time the loader:

import time

start = time.time()

for i in range(1000):
    batch = next(iter(loader))

end = time.time()
print(end-start)
  1. The sampling is very fast, but RAM memory fills up pretty quick and the processes crashes. The datatset is relatively small (2 GB) and should easily fir into memory.
  2. One thing that helped was to set IS_CUDA = False in epoch_iterator.py This works well, but keep throwing an assertion error: Exception in thread Thread-8648: Traceback (most recent call last): File "/home/rafael/anaconda3/envs/ffcv/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/home/rafael/anaconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 79, in run result = self.run_pipeline(b_ix, ixes, slot, events[slot]) File "/home/rafael/anaconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 108, in run_pipeline self.memory_context.start_batch(b_ix) File "/home/rafael/anaconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 54, in start_batch self.executor.load_batch(batch) File "/home/rafael/anaconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/schedule.py", line 114, in load_batch assert current_batch == self.next_batch

3 Setting os_cache = True fixes the issue above, but the memory blow-up happens again and the process crashes.

Any guidance on what is actually happening here?

pmeier commented 1 year ago

I'm hitting the same plain assert, but I don't think it has anything to do with memory. Here is a MWE to reproduce that shows no noticeable changes on memory usage:

$ wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
$ unzip tiny-imagenet-200.zip
import pathlib
import time

import PIL.Image
from torchvision.datasets import ImageFolder

from ffcv.fields import IntField, RGBImageField
from ffcv.fields.basics import IntDecoder
from ffcv.fields.decoders import SimpleRGBImageDecoder
from ffcv.loader import Loader
from ffcv.writer import DatasetWriter

input_path = str(pathlib.Path.cwd() / "tiny-imagenet-200" / "train")
output_path = str(pathlib.Path.cwd() / "tiny-imagenet-train.beton")

dataset = ImageFolder(input_path)

# smoke tests
assert len(dataset) == 100_000
image, label = dataset[0]
assert isinstance(image, PIL.Image.Image)
assert isinstance(label, int)

writer = DatasetWriter(
    output_path,
    fields=dict(
        image=RGBImageField(),
        label=IntField(),
    ),
)

writer.from_indexed_dataset(dataset)

loader = Loader(
    output_path,
    batch_size=1,
    pipelines=dict(
        image=[SimpleRGBImageDecoder()],
        label=[IntDecoder()],
    ),
    # without this, the error goes away
    os_cache=False,
)

next(iter(loader))

# quick and dirty way to get the full stack trace from threads
time.sleep(1)
pmeier commented 1 year ago

I think I found the issue. The problem is that the current_batch stays 0 after the first call

https://github.com/libffcv/ffcv/blob/f25386557e213711cc8601833add36ff966b80b2/ffcv/memory_managers/process_cache/schedule.py#L113-L114

while self.next_batch gets incremented:

https://github.com/libffcv/ffcv/blob/f25386557e213711cc8601833add36ff966b80b2/ffcv/memory_managers/process_cache/schedule.py#L129-L131

Going up the trace, we see that current_batch is b_idx in

https://github.com/libffcv/ffcv/blob/f25386557e213711cc8601833add36ff966b80b2/ffcv/loader/epoch_iterator.py#L72-L79

self.run_pipeline. It will only be incremented if PyTorch is compiled with CUDA support:

https://github.com/libffcv/ffcv/blob/f25386557e213711cc8601833add36ff966b80b2/ffcv/loader/epoch_iterator.py#L90-L101

That was not the case for my env and thus the internal assert triggered. @rmrafailov can you confirm that you also used a CPU-only PyTorch version?

pmeier commented 1 year ago

@rmrafailov Looking at your code, the memory leak could come from the fact that you are creating the iterator over and over. Compare

for _ in range(1000):
    batch = next(iter(loader))

with

it = iter(loader)

for _ in range(1000):
    batch = next(it)