Open Conchylicultor opened 4 years ago
@Conchylicultor I think we can decrease DPI of images so that image sizes will decreases.
Can we change images dtype like converting all images to jpg will reduce size better
The goal of the fake data is to simulate the original dataset. This put some constraints to the transformations:
@Conchylicultor I observe that images whose size increases are in .zip or.tar, because when you see these re compressed .tar and .zip files it contains folders of folders but image size in these folders is reduced as compare of original
Images are not re-compressed in better way
@Conchylicultor I observe that images whose size increases are in .zip or.tar, because when you see these re compressed .tar and .zip files it contains folders of folders but image size in these folders is reduced as compare of original
Yes, this is the issue. The new zip file should have a size smaller or equal to the original size. If the zip file contains smaller files, why would the total zip size be bigger ?
@Conchylicultor maybe because it makes folders of folders or maybe its compresses in higher size as per parameters passes for compression. I am checking on it and send you pr soon after better compression.
Some JPEG images are also badly compressed by the script (e.g. diabetic_retinopathy_detection/sample/1_left.jpeg)
@Conchylicultor How did you see the quality of compressed image??
And could it be that the images whose size is increasing after compression are actually having dimensions lower than 256x256(before compression)?
I think compression is not done its only create archive, which packed data in same size as images have after compression
@Conchylicultor Some audio files size also increased like tensorflow_datasets/testing/test_data/fake_examples/savee/AudioData.zip
by 6.58 MB which is huge as it is of .zip
. But by setting compression to ZIP_DEFLATED
it solves problem for all type of .zip files
but for .tar
there is no parameter like ZIP_DEFLATED
Can we convert .tar to another format ? or keep it same
# Hey @Conchylicultor, I just started working on this and have submitted a pull request. To address the concerns that you raised,
Increased size of zip / tar files:
I have replaced the
# Compressed the .zip file again with zipfile.ZipFile(zip_filepath, 'w') as zip_file: for file_dir, _, files in os.walk(temp_dir): for file in files: file_path = os.path.join(file_dir, file) zip_file.write(file_path)
to just
' shutil.make_archive(tar_filepath[:-4], 'gztar', temp_dir)'
Advantages of doing this:
1) Code is easily readable
2) Easy to maintain code
3) Previously the code resulted in generated zip/ tar files with absolute paths which also contributed to increased sizes despite compression. This line solves it as shutil used os.chgdir to automatically preserve directory structure without using the absolute path.
4) shutil is a native python3 library, so no additional dependencies
Tar file size has increased (for eg imagenet2012):
It was indeed surprising that despite the file contents being smaller in size, the compressed tar file was bigger compared to the original. I spent some time debugging this issue, however, I did not find any easily explainable problem.
I figured that it is only the final size of the file that matters and not its compression style.
Thus, instead of the final file being tar, I have modified it to be a gztar with really good results. (Can be verified by using my pull request)
So, I have replaced
# Converting into tarfile again to decrease the space taken by the file- with tarfile.open(tar_filepath, 'w' + extension) as tar: tar.add(temp_dir, recursive=True)
to
shutil.make_archive(tar_filepath[:-4], 'gztar', temp_dir)
The badly compressed _diabetic_retinopathy_detection/sample/1left.jpeg image For this, I tried playing around with compression options offered by PIL. Tweaking the quality parameter (default=75) to 50 did not bring about any difference. I ultimately set the optimize parameter to true which solved this issue.
The only downgrade to this solution is that it takes longer to execute (possibly because of optimize parameter in the save function) than the previous implementation.
Context: Since #1661, we have a script replace_fake_images.py which compress all images inside our fake data directory by replacing the random noise images by a uniform color image of the same shape/dtype.
Goal: However, when applied to our fake directory, some files seems to increase size, instead of being smaller (For the list of modified files, see #1634). For example: imagenet2012/ILSVRC2012_img_train.tar increase by +50KB. Some JPEG images are also badly compressed by the script (e.g. diabetic_retinopathy_detection/sample/1_left.jpeg).
The goal of this feature request would be:
Additionally, other improvement could be done in parallel by different persons: