Closed joydeepbroy-zeotap closed 1 year ago
A couple of general questions:
The objective of this function is to enable users to correctly run the optimize function on specific partitions based on the analysis done using the output of this function, right ?
What would be the output of this function for non-partitioned tables ?
A couple of general questions:
The objective of this function is to enable users to correctly run the optimize function on specific partitions based on the analysis done using the output of this function, right ?
What would be the output of this function for non-partitioned tables ?
Yes, it is targetted towards optimize function on specific partitions
If it's run on non partitioned table, it shows like this:-
DeltaHelpers.deltaPartitionWiseFileSizeDistribution(writePath).show(false)
+---------------+--------------------+------------------+-----------------+-------------+-------------+------------------------------------------------+
|partitionValues|num_of_parquet_files|mean_size_of_files|stddev |min_file_size|max_file_size|Percentile[10th, 25th, Median, 75th, 90th, 95th]|
+---------------+--------------------+------------------+-----------------+-------------+-------------+------------------------------------------------+
|[] |2 |1167.5 |31.81980515339464|1145 |1190 |[1145, 1145, 1145, 1190, 1190, 1190] |
+---------------+--------------------+------------------+-----------------+-------------+-------------+------------------------------------------------+
@joydeepbroy-zeotap LGTM. Once you remove the small detail in the docs feel free to merge.
@brayanjuls @MrPowers Wanted to start the discussion here around
delta_file_sizes
The current implementation
avg size = total size in bytes / num of files
in bothjodie
andmack
may not portray a true picture of the average file sizes. This is because production Delta Tables are almost always divided into partitions and have different file sizes. Adding all these sizes and dividing them by the number of files does not give a true picture. The true picture can only be obtained when we traverse down the leaf partitions and find themean file size
along with theStandard Deviation
and themin
&max
. Since we required these statistics for making decisions around compaction, such metrics at the partition level might provide good insight as to which partition needs compaction. Furthermore, the same analysis can be done with thenumber of records
hosted in each parquet file. Generally there would be one-to-one mapping betweensize
andnum_records
. But sometimes it might be different as lownum_records
can have hugefile_size
s due to complex datatypes like map and arrays in the schema.This current PR aims at providing such insights to the user at each partition level as well as overall Delta Table level. To provide context here is an example from one of our prod tables:
As you can see the distribution of file sizes and num records is really varied across partitions and cannot be represented by one global average. I have not added tests as of now but will be doing it in some days. Till then it is in the draft stage but wanted to open up the discussion. Also, please suggest good method names.
The file sizes a little off, working on fixing it to shows size in MBs