gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations
Apache License 2.0
137 stars 30 forks source link

Should we remove duplicates? #9

Open ianchute opened 5 years ago

ianchute commented 5 years ago

I have a huge amount of duplicates in my data and I'm wondering if it is fine to remove them. I'm aware that duplicate removal changes the distribution of the data and most machine learning models would not work well with it. My problem is that my data is huge and any way to make corex faster would certainly be of much benefit (like duplicate removal).

So, is it fine to remove duplicates? If yes, what are the adverse effects?

gregversteeg commented 5 years ago

Yes, I think you should remove duplicates. CorEx looks for clusters of variables with high mutual information. Duplicate columns have the highest mutual information possible, so they will dominate the signal and possibly wash out more interesting relationships. I would look at it like duplicates reflect something artificial about the data processing, by taking them out we can discover the intrinsic relationships in the data.

Another way to look at this is hierarchically. CorEx will lump together duplicates in the first layer and associate these duplicate columns with a single factor. Then you might be able to find weaker relationships between the factor representing the duplicates and other factors in the second layer. By taking out the duplicates, you are essentially just adding a layer 0 that manually extracts this known and uninteresting source of dependence.