shakedzy / dython

A set of data tools in Python
http://shakedzy.xyz/dython/
MIT License
496 stars 102 forks source link

Add option to drop nan values in each pair of columns independently #130

Closed matbb closed 1 year ago

matbb commented 2 years ago

Describe the new feature:

When NaN values in third column interfere with calculation of correlation of two columns, dropping samples in each pair of columns independently gives more insight. Example columns: c1 = [ 1, 2, 3, 4, nan, ] c2 = [ 1, 2, 3, nan, 5, ] c3 = [ nan, nan, nan, 2, 1, ] In dataframe with all columns and nan strategy set to drop samples, correlation between columns c1 and c2 is not 1 (using nan in c3 one can construct the dataframe to give variably inappropriate correlation coefficients). When nan strategy is set to replacement value, coefficient is also not what would be intuitively correct (=1), and can be arbitrarily far from the intuitively correct value.

Proposal: allow dropping samples for each pair of columns independently before their correlation is calculated.

What is the current outcome?

Values not matching intuitively expected result. Difference can be significant, depending on the correlation of NaN values in third columns with values in observed pair of columns.

Is it backward-compatible?

Yes, change is only applied when nan_strategy is set to new option.