Open smilesun opened 5 years ago
As an approximation we can
Not sure if we could also compute the KL divergence (or any other divergence) between two datasite? so we could conclude the the em-clustering algorithm splitted data site has a bigger divergence than the stratified split?
see #25
how about
from sklearn import metrics
metrics.calinski_harabaz_score(data_pca, x_pred)
if we have some algorithm that can split one arbitrary dataset into different parts which we call datasite. how can we evaluate how uneven the data is distributed on each datasite? The Mann-Whitney test only works for small data?
We need to consider seriously how to evaluate or profile the split of dataset into different partition or datasites. How unevenly distributed they are, etc. I know there exist large dataset Mann-Whitney test to compare the distribution of several dataset, but not sure if there are other methods.