libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
Apache License 2.0
2.8k stars 180 forks source link

Reading .beton is extremely slow on HDD even if RAM is already filled with identical dataset #250

Open numpee opened 1 year ago

numpee commented 1 year ago

Current FFCV version: v0.0.4 compiled from source.

I have two identical FFCV .beton files (ImageNet train dataset): One in an NVME ssd drive, and the other in an HDD drive.

Using the .beton file from SSD, I can load the full dataset in less than 6 minutes, at around 7 it/s (batch_size=512, data loading only - no model forward/backward pass). Using free -mh I have checked that the dataset is cached correctly in the RAM, and I have set os_cache=True in the Loader initialization. From my understanding, a consecutive pass through the same .beton file should load files directly from the cache. However, when performing a data loading pass using the .beton file from my HDD, it seems as though the entire dataset is re-loaded onto the cache. From the HDD, data loading runs at around 1.2 it/s, and takes a total of ~30 minutes to complete, which is the same time it takes to load the .beton file after freeing the cache.

Does the difference in path (SSD: /nvme/imagenet.beton, HDD: /hdd/imagenet.beton) have an effect on the way the samples are stored in the cache, regardless of whether the two .beton files are identical or not?