Quantify difference between datasets

miguelgfierro / ai_projects

AI projects

https://miguelgfierro.com/

Other

797 stars 180 forks source link

Quantify difference between datasets #51

Closed miguelgfierro closed 6 years ago

miguelgfierro commented 6 years ago

2 dimensions as stated in this transfer learning tutorial: size and similarity to the original dataset. A) The size can be in number of examples and number of examples per class, here maybe we can do a weighted quadratic difference. B) similarity, there is literature on image similarity (maybe KL for colour and texture)

miguelgfierro commented 6 years ago

structural similarity index for image similarity:

attempt to quantify the visibility of errors between a distorted image and a reference image
code: http://scikit-image.org/docs/dev/api/skimage.measure.html?highlight=ssim#compare-ssim
slides: https://ece.uwaterloo.ca/~z70wang/publications/SPM09_figures.pdf
paper: https://ece.uwaterloo.ca/~z70wang/publications/ssim.pdf

miguelgfierro commented 6 years ago

Ways to describe a dataset: Center. Graphically, the center of a distribution is the point where about half of the observations are on either side.  Spread. The spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are clustered around a single value, the spread is smaller.  Shape. The shape of a distribution is described by symmetry , skewness , number of peaks, etc.  Unusual features. Unusual features refer to gaps (areas of the distribution where there are no observations) and outliers .

miguelgfierro commented 6 years ago

Comparing Measures of Sparsity: https://arxiv.org/abs/0811.4706 Gini index: https://github.com/oliviaguest/gini measures inequality or sparsity

miguelgfierro commented 6 years ago

Clustering: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html https://openproceedings.org/2014/conf/edbt/FriesWS14.pdf performance of clustering algos: http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html DBSCAN & HDBSCAN seems interesting explanation of different clustering algos: http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html