libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.82k stars 178 forks source link

os_cache=False does not work. #146

Closed erow closed 2 years ago

erow commented 2 years ago

The following code does not work:

from ffcv.loader import Loader, OrderOption
from ffcv.transforms import ToTensor, ToDevice, ToTorchImage, Cutout
from ffcv.fields.decoders import IntDecoder, RandomResizedCropRGBImageDecoder

# Random resized crop
decoder = RandomResizedCropRGBImageDecoder((224, 224))

# Data decoding and augmentation
image_pipeline = [decoder, ToTensor(), ToTorchImage(), ToDevice(0)]
label_pipeline = [IntDecoder(), ToTensor(),ToDevice(0)]

# Pipeline for each data field
pipelines = {
    'image': image_pipeline,
    'label': label_pipeline
}

write_path = '/data/imagenet/train.ffcv'
num_workers = 4
bs=64
loader = Loader(write_path, batch_size=bs, num_workers=num_workers,
    os_cache = False,
                order=OrderOption.RANDOM, pipelines=pipelines)

The error message:

 MemoryError                               Traceback (most recent call last)
/home/nbic/erow/pretrained/src/data/ffcv_dataset.py in <module>
      [125](file:///home/nbic/erow/pretrained/src/data/ffcv_dataset.py?line=124) num_workers = 4
      [126](file:///home/nbic/erow/pretrained/src/data/ffcv_dataset.py?line=125) bs=64
----> [127](file:///home/nbic/erow/pretrained/src/data/ffcv_dataset.py?line=126) loader = Loader(write_path, batch_size=bs, num_workers=num_workers,
      [128](file:///home/nbic/erow/pretrained/src/data/ffcv_dataset.py?line=127)     os_cache = False,
      [129](file:///home/nbic/erow/pretrained/src/data/ffcv_dataset.py?line=128)                 order=OrderOption.RANDOM, pipelines=pipelines)

File ~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/loader.py:134, in Loader.__init__(self, fname, batch_size, num_workers, os_cache, order, distributed, seed, indices, pipelines, custom_fields, drop_last, batches_ahead, recompile)
    [132](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/loader.py?line=131) self.batches_ahead = batches_ahead
    [133](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/loader.py?line=132) self.seed: int = seed
--> [134](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/loader.py?line=133) self.reader: Reader = Reader(self.fname, custom_fields)
    [135](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/loader.py?line=134) self.num_workers: int = num_workers
    [136](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/loader.py?line=135) self.drop_last: bool = drop_last

File ~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py:15, in Reader.__init__(self, fname, custom_handlers)
     [13](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py?line=12) self.read_field_descriptors()
     [14](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py?line=13) self.read_metadata()
---> [15](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py?line=14) self.read_allocation_table()

File ~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py:67, in Reader.read_allocation_table(self)
     [65](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py?line=64) def read_allocation_table(self):
     [66](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py?line=65)     offset = self.header['alloc_table_ptr']
---> [67](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py?line=66)     alloc_table = np.fromfile(self._fname, dtype=ALLOC_TABLE_TYPE,
     [68](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py?line=67)                               offset=offset)
     [69](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py?line=68)     alloc_table.setflags(write=False)
     [70](file:///~/.conda/envs/ffcv/lib/python3.9/site-packages/ffcv/reader.py?line=69)     self.alloc_table = alloc_table

MemoryError: Unable to allocate 385. GiB for an array with shape (17242170709,) and data type [('sample_id', '<u8'), ('ptr', '<u8'), ('size', '<u8')]

My environment is python3.9 + torch1.7.1 cu10.1 with 64G memory. Is it a wrong setting or a bug?

GuillaumeLeclerc commented 2 years ago

Hi!

We have a dedicated paragraph in the documentation covering this exact use case: https://docs.ffcv.io/parameter_tuning.html#scenario-large-scale-datasets.

TLDR: you shouldn't use RANDOM. Perfect uniform sampling of dataset with small samples will require to almost keep the entire dataset in RAM.

erow commented 2 years ago

It may be a bug about error information. I find that the disk is out of storage, so the file is incomplete. However, I did not see any alert. It should be better to check the validation of the file before loading it.