miguelgfierro / ai_projects

AI projects
https://miguelgfierro.com/
Other
797 stars 180 forks source link

Quantify difference between datasets #51

Closed miguelgfierro closed 6 years ago

miguelgfierro commented 6 years ago

2 dimensions as stated in this transfer learning tutorial: size and similarity to the original dataset. A) The size can be in number of examples and number of examples per class, here maybe we can do a weighted quadratic difference. B) similarity, there is literature on image similarity (maybe KL for colour and texture)

miguelgfierro commented 6 years ago

structural similarity index for image similarity:

miguelgfierro commented 6 years ago

Ways to describe a dataset: Center. Graphically, the center of a distribution is the point where about half of the observations are on either side.  Spread. The spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are clustered around a single value, the spread is smaller.  Shape. The shape of a distribution is described by symmetry , skewness , number of peaks, etc.  Unusual features. Unusual features refer to gaps (areas of the distribution where there are no observations) and outliers .

miguelgfierro commented 6 years ago

Comparing Measures of Sparsity: https://arxiv.org/abs/0811.4706 Gini index: https://github.com/oliviaguest/gini measures inequality or sparsity

miguelgfierro commented 6 years ago

Clustering: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html https://openproceedings.org/2014/conf/edbt/FriesWS14.pdf performance of clustering algos: http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html DBSCAN & HDBSCAN seems interesting explanation of different clustering algos: http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html