Closed SerezD closed 1 month ago
Using order=OrderOption.RANDOM
will attempt to load all the data in memory to do "perfect" shuffling. Using OrderOption.QUASI_RANDOM
will keep a much smaller amount of the data in memory while still allowing for shuffling, although it doesn't work (for now) in distributed settings. Does using this other option for the order resolve the issue?
Hi @charlesjhill , thank you for your suggestion and sorry for the late reply.
I have now tried to run some tests in a new environment, running latest ffcv
and torch
versions. Surprisingly, I am not able to reproduce the behavior described above.
Here is the code to create beton files (the only difference is the max_resolution
argument):
from torch.utils.data import Dataset
from PIL import Image
from ffcv.fields import RGBImageField
from ffcv.writer import DatasetWriter
import pathlib
# custom torch Image Dataset object
class ImageDataset(Dataset):
def __init__(self, folder: str):
"""
:param folder: path to images
"""
self.samples = sorted(list(pathlib.Path(folder).rglob('*.png')) + list(pathlib.Path(folder).rglob('*.jpg')) +
list(pathlib.Path(folder).rglob('*.bmp')) + list(pathlib.Path(folder).rglob('*.JPEG')))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
# path to string
image_path = self.samples[idx].absolute().as_posix()
image = Image.open(image_path).convert('RGB')
return (image,)
# main part
final_path = '~/Documents/datasets/imagenet/ffcv/bugged_train.beton' # final path for beton file
data_folder = '~/Documents/datasets/imagenet/train/' # path to images
# create dataset
dataset = ImageDataset(folder=data_folder)
# create writer [VERSION 1]
# writer = DatasetWriter(final_path, {
# 'image': RGBImageField(write_mode='jpg', max_resolution=256),
# }, num_workers=8)
# create writer [VERSION 2]
writer = DatasetWriter(final_path, {
'image': RGBImageField(write_mode='jpg'),
}, num_workers=8)
writer.from_indexed_dataset(dataset)
Then, I am loading images (testing the two versions):
from ffcv.loader import Loader, OrderOption
from ffcv.transforms import ToTensor, ToDevice, ToTorchImage
from ffcv.fields.rgb_image import CenterCropRGBImageDecoder
import torch
import time
path = '~/Documents/datasets/imagenet/ffcv/bugged_train.beton' # path for beton file versionx.beton
batch_size = 128 # tried different values
# BATCH SIZE
os_cache = False # tried both
order = OrderOption.RANDOM # try OrderOption.RANDOM or OrderOption.QUASI_RANDOM
image_size = 256
loader = Loader(path,
batch_size=batch_size,
num_workers=8,
order=order,
pipelines={
'image':
[
CenterCropRGBImageDecoder((image_size, image_size), ratio=1.),
ToTensor(),
ToDevice(torch.device(0), non_blocking=True),
ToTorchImage(),
]
},
os_cache=os_cache)
print(f'Testing with Batch Size = {batch_size}')
print(f'Testing with Order = {order}')
start = time.time()
for i, batch in enumerate(loader):
images = batch[0]
print(f'{i}: {images.shape}')
if i == 15:
break
print(f'Duration: {time.time() - start}')
When running the version with the max size at 256, this is the output:
Testing with Batch Size = 128
Testing with Order = OrderOption.RANDOM
0: torch.Size([128, 3, 256, 256])
1: torch.Size([128, 3, 256, 256])
2: torch.Size([128, 3, 256, 256])
3: torch.Size([128, 3, 256, 256])
4: torch.Size([128, 3, 256, 256])
5: torch.Size([128, 3, 256, 256])
6: torch.Size([128, 3, 256, 256])
7: torch.Size([128, 3, 256, 256])
8: torch.Size([128, 3, 256, 256])
9: torch.Size([128, 3, 256, 256])
10: torch.Size([128, 3, 256, 256])
11: torch.Size([128, 3, 256, 256])
12: torch.Size([128, 3, 256, 256])
13: torch.Size([128, 3, 256, 256])
14: torch.Size([128, 3, 256, 256])
15: torch.Size([128, 3, 256, 256])
Duration: 11.110697746276855
Previously, running the same code with the "bugged beton" would cause my machine to freeze. Now, it raises a Runtime Error
Testing with Batch Size = 128
Testing with Order = OrderOption.RANDOM
Traceback (most recent call last):
File "/.../ffcv_debug/run_loader.py", line 35, in <module>
for i, batch in enumerate(loader):
^^^^^^^^^^^^^^^^^
File "/.../python3.11/site-packages/ffcv/loader/loader.py", line 226, in __iter__
return EpochIterator(self, selected_order)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../python3.11/site-packages/ffcv/loader/epoch_iterator.py", line 65, in __init__
self.memory_allocations = self.loader.graph.allocate_memory(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../python3.11/site-packages/ffcv/pipeline/graph.py", line 370, in allocate_memory
allocated_buffer = tuple(
^^^^^^
File "/.../python3.11/site-packages/ffcv/pipeline/graph.py", line 371, in <genexpr>
allocate_query(q, batch_size, batches_ahead) for q in memory_allocation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../python3.11/site-packages/ffcv/pipeline/allocation_query.py", line 35, in allocate_query
result = ch.empty(*final_shape,
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at alloc_cpu.cpp:83] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 116988345600 bytes. Error code 12 (Cannot allocate memory)
Apparently, some behavior changed and now there is an extra check that prevents allocating too much RAM memory. However, I do not have access to the previous versions of the code anymore, so I can't double check what specifically changes between the two versions.
Anyway, I guess the issue can be closed
Hello, I have created two different versions of the imagenet .beton file.
The code is the following:
The only difference between the two versions is the "max_resolution" parameter.
The two datasets are correctly created, and
version1.beton
is approx20 GB
, whileversion2.beton
is approx80 GB
.At loading time:
With version 1 everything works fine. Depending on the batch size and os_cache params, training may be faster or slower, but everything seems ok.
With version 2
80 GB
, if I set batch size as high as possible, the training is very slow and may even freeze the machine completely.By monitoring resources, I noticed a high cpu ram usage (up to 100% right before freezing). I have tried both params
os_cache = True
andos_cache = False
with the latter freezing the machine even before training starts. Withos_cache = True
, usually a couple of batches are loaded and the first epoch steps are done before freezing.I have reproduced the bug on two different machines, with different operating systems, GPUS and hardware, so I don't think this is machine-related.