julioasotodv / spark-df-profiling

Create HTML profiling reports from Apache Spark DataFrames
MIT License
195 stars 77 forks source link

Optimize corr_matrix() performance #35

Closed XD-DENG closed 4 years ago

XD-DENG commented 4 years ago

Correlation matrix has two important properties:

These two properties can help reduce the times of invoking DF.corr() from n^2 to (n^2-n)/2 (cut by more than half).

Say there are 100 columns. The current implementation in master branch will need to call DF.corr() for 10,000 times, while my code can help reduce this number to 4,950.

Visualize the Idea

Blue cells indicate the entries for which invoking DF.corr() is required.

Green cells indicate the entries for which we don't have to invoke DF.corr() due to the properties of correlation matrix. 4A4CE008-5449-4159-AAF9-D51E48619069

XD-DENG commented 4 years ago

Hi @julioasotodv , mind taking a look? Let me know if any clarification is needed. Cheers.

julioasotodv commented 4 years ago

Indeed, you are totally right.

Merging. Thank you very much!