uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

How to reduce parquet size #747

Open journey-wang opened 2 years ago

journey-wang commented 2 years ago

Hi Everyone,

I've stored 899 images (about 48MB) into petastorm parquet. But I've got almost 240MB parquet files. Please help to figure out why the parquet files are so big and how to reduce the size ?

The code I used from https://github.com/uber/petastorm/issues/497

root@br1609hpc30:~# find flower_photos/dandelion/|wc -l 899 root@br1609hpc30:~# du -sh flower_photos/dandelion/ 48M flower_photos/dandelion/ root@br1609hpc30:~# du -sh /tmp/petastorm_ingest_test/ 240M /tmp/petastorm_ingest_test/

Best regards.

selitvin commented 2 years ago

Tried reproducing the issue in this PR: https://github.com/uber/petastorm/pull/749

Got:

Parquet size 89105.625 KB
png file size: 88.3056640625 KB
Size per parquet row: 89.105625 KB

I.e. size of the parquet store matches the expectation - no significant overhead is observed.