libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.8k stars 180 forks source link

Reading .beton is extremely slow on HDD even if RAM is already filled with identical dataset #250

Open numpee opened 1 year ago

numpee commented 1 year ago

Current FFCV version: v0.0.4 compiled from source.

I have two identical FFCV .beton files (ImageNet train dataset): One in an NVME ssd drive, and the other in an HDD drive.

Using the .beton file from SSD, I can load the full dataset in less than 6 minutes, at around 7 it/s (batch_size=512, data loading only - no model forward/backward pass). Using free -mh I have checked that the dataset is cached correctly in the RAM, and I have set os_cache=True in the Loader initialization. From my understanding, a consecutive pass through the same .beton file should load files directly from the cache. However, when performing a data loading pass using the .beton file from my HDD, it seems as though the entire dataset is re-loaded onto the cache. From the HDD, data loading runs at around 1.2 it/s, and takes a total of ~30 minutes to complete, which is the same time it takes to load the .beton file after freeing the cache.

Does the difference in path (SSD: /nvme/imagenet.beton, HDD: /hdd/imagenet.beton) have an effect on the way the samples are stored in the cache, regardless of whether the two .beton files are identical or not?