Should we remove duplicates?

gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations

Apache License 2.0

137 stars 30 forks source link

Yes, I think you should remove duplicates. CorEx looks for clusters of variables with high mutual information. Duplicate columns have the highest mutual information possible, so they will dominate the signal and possibly wash out more interesting relationships. I would look at it like duplicates reflect something artificial about the data processing, by taking them out we can discover the intrinsic relationships in the data.

Another way to look at this is hierarchically. CorEx will lump together duplicates in the first layer and associate these duplicate columns with a single factor. Then you might be able to find weaker relationships between the factor representing the duplicates and other factors in the second layer. By taking out the duplicates, you are essentially just adding a layer 0 that manually extracts this known and uninteresting source of dependence.

gregversteeg / bio_corex

Should we remove duplicates? #9