how to compare the distribution of different parts of a splitted dataset(Mann-Whitney test or? )

slds-lmu / paper_2019_multiobjective_rfms

High Dimensional Restrictive Federated Model Selection with multi-objective Bayesian Optimization over shifted distributions

https://github.com/smilesun/tex_2018_intellisys2019_fmoboms

2 stars 0 forks source link

how to compare the distribution of different parts of a splitted dataset(Mann-Whitney test or? ) #8

Open smilesun opened 5 years ago

smilesun commented 5 years ago

if we have some algorithm that can split one arbitrary dataset into different parts which we call datasite. how can we evaluate how uneven the data is distributed on each datasite? The Mann-Whitney test only works for small data?

We need to consider seriously how to evaluate or profile the split of dataset into different partition or datasites. How unevenly distributed they are, etc. I know there exist large dataset Mann-Whitney test to compare the distribution of several dataset, but not sure if there are other methods.

pfistfl commented 5 years ago

As an approximation we can

Look at differences between first n principal components

smilesun commented 5 years ago

Not sure if we could also compute the KL divergence (or any other divergence) between two datasite? so we could conclude the the em-clustering algorithm splitted data site has a bigger divergence than the stratified split?

smilesun commented 5 years ago

see #25

smilesun commented 5 years ago

how about

from sklearn import metrics
metrics.calinski_harabaz_score(data_pca, x_pred)