libffcv / ffcv-imagenet

Train ImageNet *fast* in 500 lines of code with FFCV
Apache License 2.0
136 stars 34 forks source link

Imagenet dataset preparation size #1

Closed aniketrege closed 2 years ago

aniketrege commented 2 years ago

In an attempt to replicate results as a sanity test, I ran the data preparation script as ./write_imagenet.sh 500 0.50 90 in its default configuration on Imagenet dataset. I can see from the documentation provided at https://docs.ffcv.io/benchmarks.html that initializing the writer with RGBImageField(write_mode=proportion, compress_probability=0.5, max_resolution= 512, jpeg_quality=90) should generate a dataset of size 202.04 GB. However when I ran this myself, I got a train dataset size 337 GB and val 15 GB.

I am wondering if the compress_probability value used in the documentation at https://docs.ffcv.io/benchmarks.html was higher than 0.5, which leads to a smaller dataset size than I got? It's a little unclear why I have a 40% larger dataset using similar configuration values.

I'm also a bit confused with the comment below, as per my understanding using prob=0.5 means that you use JPEG encoding for 50% of the images, and raw pixel values for 50% of the images (not 90%?)

# Serialize images with:
# - 500px side length maximum
# - 50% JPEG encoded, 90% raw pixel values
# - quality=90 JPEGs
./write_imagenet.sh 500 0.50 90
bhattg commented 2 years ago

@lengstrom @andrewilyas Hello, thanks for the great framework! Any updates on this issue? It will be very helpful :-)

lengstrom commented 2 years ago

Good catch on the readme, we are looking into the dataset size issue!

Riretta commented 2 years ago

I have the same problem :(

lengstrom commented 2 years ago

The benchmarks page is wrong, we will update it. ImageNet is 339G/16G (train/val) when I run the script.

mzhaoshuai commented 2 years ago

https://github.com/libffcv/ffcv/blob/bfd9b3d85e31360fada2ecf63bea5602e4774ba3/ffcv/fields/rgb_image.py#L337

        write_mode = self.write_mode
        as_jpg = None

        if write_mode == 'smart':
            as_jpg = encode_jpeg(image, self.jpeg_quality)
            write_mode = 'raw'
            if self.smart_threshold is not None:
                if image.nbytes > self.smart_threshold:
                    write_mode = 'jpg'
        elif write_mode == 'proportion':
            if np.random.rand() < self.proportion:
                write_mode = 'jpg'
            else:
                write_mode = 'raw'

The default write mode in https://github.com/libffcv/ffcv-imagenet/blob/main/write_imagenet.py is smart, and the smart_threshold is None. So the script is running in RAW write mode? @lengstrom

lengstrom commented 2 years ago

The write mode that should have been used is proportion, I am pretty sure this is what was used to generate these datasets.

mzhaoshuai commented 2 years ago

The write mode that should have been used is proportion, I am pretty sure this is what was used to generate these datasets.

Thx for your quick reply. https://github.com/libffcv/ffcv-imagenet/blob/e97289fdacb4b049de8dfefefb250cc35abb6550/write_imagenet.py#L17

I see the script is smart, so this is just a typo, you actually use proportion when create the ffcv format dataset?

mzhaoshuai commented 2 years ago

My fault. I see the line. Thx for your reply. https://github.com/libffcv/ffcv-imagenet/blob/e97289fdacb4b049de8dfefefb250cc35abb6550/write_imagenet.sh#L12