Closed rmrafailov closed 1 year ago
I'm hitting the same plain assert
, but I don't think it has anything to do with memory. Here is a MWE to reproduce that shows no noticeable changes on memory usage:
$ wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
$ unzip tiny-imagenet-200.zip
import pathlib
import time
import PIL.Image
from torchvision.datasets import ImageFolder
from ffcv.fields import IntField, RGBImageField
from ffcv.fields.basics import IntDecoder
from ffcv.fields.decoders import SimpleRGBImageDecoder
from ffcv.loader import Loader
from ffcv.writer import DatasetWriter
input_path = str(pathlib.Path.cwd() / "tiny-imagenet-200" / "train")
output_path = str(pathlib.Path.cwd() / "tiny-imagenet-train.beton")
dataset = ImageFolder(input_path)
# smoke tests
assert len(dataset) == 100_000
image, label = dataset[0]
assert isinstance(image, PIL.Image.Image)
assert isinstance(label, int)
writer = DatasetWriter(
output_path,
fields=dict(
image=RGBImageField(),
label=IntField(),
),
)
writer.from_indexed_dataset(dataset)
loader = Loader(
output_path,
batch_size=1,
pipelines=dict(
image=[SimpleRGBImageDecoder()],
label=[IntDecoder()],
),
# without this, the error goes away
os_cache=False,
)
next(iter(loader))
# quick and dirty way to get the full stack trace from threads
time.sleep(1)
I think I found the issue. The problem is that the current_batch
stays 0
after the first call
while self.next_batch
gets incremented:
Going up the trace, we see that current_batch
is b_idx
in
self.run_pipeline
. It will only be incremented if PyTorch is compiled with CUDA support:
That was not the case for my env and thus the internal assert triggered. @rmrafailov can you confirm that you also used a CPU-only PyTorch version?
@rmrafailov Looking at your code, the memory leak could come from the fact that you are creating the iterator over and over. Compare
for _ in range(1000):
batch = next(iter(loader))
with
it = iter(loader)
for _ in range(1000):
batch = next(it)
Thank you very much for this project! I've been trying to implement this in training pipeline, but have been running into some issues. I created a custom dataset consisting of a combination of images, arrays and scalars. I then create a loader in the following way:
and time the loader:
3 Setting os_cache = True fixes the issue above, but the memory blow-up happens again and the process crashes.
Any guidance on what is actually happening here?