[GSoC] Better fake data compression script - Githubissues

tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

https://www.tensorflow.org/datasets

Apache License 2.0

4.28k stars 1.53k forks source link

[GSoC] Better fake data compression script #1666

Open Conchylicultor opened 4 years ago

Conchylicultor commented 4 years ago

Context: Since #1661, we have a script replace_fake_images.py which compress all images inside our fake data directory by replacing the random noise images by a uniform color image of the same shape/dtype.

Goal: However, when applied to our fake directory, some files seems to increase size, instead of being smaller (For the list of modified files, see #1634). For example: imagenet2012/ILSVRC2012_img_train.tar increase by +50KB. Some JPEG images are also badly compressed by the script (e.g. diabetic_retinopathy_detection/sample/1_left.jpeg).

The goal of this feature request would be:

Investigate why some archives/image size increase after being compressed.
Find a way to better compress the images / archives (e.g. by playing with the compression parameters of PIL https://pillow.readthedocs.io/en/5.1.x/handbook/image-file-formats.html#jpeg)

Additionally, other improvement could be done in parallel by different persons:

Update the script to also compress audio files
Update the script to also compress video files.

Eshan-Agarwal commented 4 years ago

@Conchylicultor I think we can decrease DPI of images so that image sizes will decreases.

Eshan-Agarwal commented 4 years ago

Can we change images dtype like converting all images to jpg will reduce size better

Conchylicultor commented 4 years ago

The goal of the fake data is to simulate the original dataset. This put some constraints to the transformations:

Images must be of the same type (e.g. jpeg -> jpeg)
Image must have the same shape (e.g. 256x256 -> 256x256)
Image must keep the same number of channel (e.g. RGBA -> RGBA)
Image must be the same type (e.g. np.bool -> np.bool)
...

Eshan-Agarwal commented 4 years ago

@Conchylicultor I observe that images whose size increases are in .zip or.tar, because when you see these re compressed .tar and .zip files it contains folders of folders but image size in these folders is reduced as compare of original

Eshan-Agarwal commented 4 years ago

Images are not re-compressed in better way

Conchylicultor commented 4 years ago

@Conchylicultor I observe that images whose size increases are in .zip or.tar, because when you see these re compressed .tar and .zip files it contains folders of folders but image size in these folders is reduced as compare of original

Yes, this is the issue. The new zip file should have a size smaller or equal to the original size. If the zip file contains smaller files, why would the total zip size be bigger ?

Eshan-Agarwal commented 4 years ago

@Conchylicultor maybe because it makes folders of folders or maybe its compresses in higher size as per parameters passes for compression. I am checking on it and send you pr soon after better compression.

ManishAradwad commented 4 years ago

Some JPEG images are also badly compressed by the script (e.g. diabetic_retinopathy_detection/sample/1_left.jpeg) @Conchylicultor How did you see the quality of compressed image??

And could it be that the images whose size is increasing after compression are actually having dimensions lower than 256x256(before compression)?

Eshan-Agarwal commented 4 years ago

I think compression is not done its only create archive, which packed data in same size as images have after compression

Eshan-Agarwal commented 4 years ago

@Conchylicultor Some audio files size also increased like tensorflow_datasets/testing/test_data/fake_examples/savee/AudioData.zip by 6.58 MB which is huge as it is of .zip. But by setting compression to ZIP_DEFLATED it solves problem for all type of .zip files but for .tar there is no parameter like ZIP_DEFLATED

Can we convert .tar to another format ? or keep it same

MadhavEsDios commented 4 years ago

# Hey @Conchylicultor, I just started working on this and have submitted a pull request. To address the concerns that you raised,

Increased size of zip / tar files: I have replaced the # Compressed the .zip file again with zipfile.ZipFile(zip_filepath, 'w') as zip_file: for file_dir, _, files in os.walk(temp_dir): for file in files: file_path = os.path.join(file_dir, file) zip_file.write(file_path) to just ' shutil.make_archive(tar_filepath[:-4], 'gztar', temp_dir)' Advantages of doing this: 1) Code is easily readable 2) Easy to maintain code 3) Previously the code resulted in generated zip/ tar files with absolute paths which also contributed to increased sizes despite compression. This line solves it as shutil used os.chgdir to automatically preserve directory structure without using the absolute path. 4) shutil is a native python3 library, so no additional dependencies
Tar file size has increased (for eg imagenet2012): It was indeed surprising that despite the file contents being smaller in size, the compressed tar file was bigger compared to the original. I spent some time debugging this issue, however, I did not find any easily explainable problem. I figured that it is only the final size of the file that matters and not its compression style. Thus, instead of the final file being tar, I have modified it to be a gztar with really good results. (Can be verified by using my pull request) So, I have replaced # Converting into tarfile again to decrease the space taken by the file- with tarfile.open(tar_filepath, 'w' + extension) as tar: tar.add(temp_dir, recursive=True) to shutil.make_archive(tar_filepath[:-4], 'gztar', temp_dir)
The badly compressed _diabetic_retinopathy_detection/sample/1left.jpeg image For this, I tried playing around with compression options offered by PIL. Tweaking the quality parameter (default=75) to 50 did not bring about any difference. I ultimately set the optimize parameter to true which solved this issue.

The only downgrade to this solution is that it takes longer to execute (possibly because of optimize parameter in the save function) than the previous implementation.