Advice on increasing dataloading speed given ample resources - os_cache no effect?

Hi there - thank you for releasing a much needed data loading library for the pytorch community.

I've read over all the information in the performance guide and am empirically finding approximately equal performance across a wide variety of settings. I'd love to hear if there is anything I might be missing.

Context:

Dataset consists of about 2e5 images (640x448, single channel)
Hardware at disposal consists of A100's, 1.9TB RAM, 256 CPUs.
Runs are non-distributed, single runs on single A100.
torch Dataset uses a number of image augmentations, primarily from Albumentations library

Observations:

After writing images as raw, this yielded a 133GB train 24GB val set. Raw encoding improved speed slightly.
os_cache had no effect on timings
order (OrderOption.RANDOM, OrderOption.QUASI_RANDOM) had no effect on timing
batches_ahead had no effect on timings
repeated above with smaller datasets and observed same outcome
among num_workers (4, 16, 64), 16 workers is marginally better

Curiosities:

Since my augmentations are performed before dataset writing, these are entirely encapsulated in the beton files, correct? Thus there should not be any bottlenecks due to slow torch (i.e. non numba) code.
Is there an OS level limit I should be aware of w.r.t. RAM allocation? Ideally I'd like to cache the entire dataset in main memory.

Overall I would say my training speed is by no means bad or worse, but I did expect a much larger improvement with os_cache in particular. Appreciate any advice, thanks!

Extra: My deecoding pipeline is very basic.

   pipeline = {
        "x_img": [
            SimpleRGBImageDecoder(),
            ToTensor(),
            ToDevice(device, non_blocking=True),
            # channel_last = True set by default, but maintains NCHW shape
            ToTorchImage(),
            NormalizeImage(mean, std, type=np.float32),
            Grayscale(),
        ],
        "x_2d": [
            NDArrayDecoder(),
            ToTensor(),
            ToDevice(device, non_blocking=True),
        ],
        "hbox_norm": [
            NDArrayDecoder(),
            ToTensor(),
            ToDevice(device, non_blocking=True),
        ],
        "box_norm": [
            NDArrayDecoder(),
            ToTensor(),
            ToDevice(device, non_blocking=True),
        ],
        "q": [
            NDArrayDecoder(),
            ToTensor(),
            ToDevice(device, non_blocking=True),
        ],
        "trans": [
            NDArrayDecoder(),
            ToTensor(),
            ToDevice(device, non_blocking=True),
        ],
        "so3": [
            NDArrayDecoder(),
            ToTensor(),
            ToDevice(device, non_blocking=True),
        ],
        "annot": [BytesDecoder()],
    }

libffcv / ffcv

Advice on increasing dataloading speed given ample resources - os_cache no effect? #207