Large Datasets in terms of Number of Attributes

FiratIsmailoglu commented 3 years ago

Hi, thank you very mucf for making such a great library accesible.. I'd like to get you advise regarding large datasets. That is, I have some supervised gene microarray datasets with number of features around 10K, and my goal is classification. So, which supervised metric learning algorithm would you recemommend in this case, and what kind of prepecosessing should I make prior to implementing the metric learning library? Many thanks..

Note:Having applied PCA, the algorithms perform quite bad.

wdevazelhes commented 3 years ago

Hi @FiratIsmailoglu, thanks for using metric-learn!

I am not familiar with gene microarray classification, so I couldn't say which algorithm exactly would be best, but indeed if your problem is classification you should use the supervised ones as you said http://contrib.scikit-learn.org/metric-learn/metric_learn.html#supervised-learning-algorithms, maybe you could try them all (except MLKR which is for regression) and see which works best ;) If they are too expensive to compute/give bad results, knowing that you have a large dimensionality dataset, you could play with the n_components argument and put a smaller number of components. For better results you can also play with the other arguments, in particular the init one (it is set to ‘auto’ by default but you can specify explicitly one particular initialization).

Regarding the preprocessing, you could try to simply center and normalize your data, and I would advise trying not to do a PCA pre-processing: indeed metric-learn algorithms already learn a matrix L to transform the data, just like PCA (considering the data is centered). So if you do metric-learning after having done a PCA preprocessing, you will only be able to search for transformations of your space of the form $L_{\text{PCA}}.L_{\text{metric-learn}}$ , (varying $L_{\text{metric-learn}$ , with $L_{PCA}$ fixed by the previous pre-processing step), which can only do worse (in terms of the cost function of the metric-learning algorithm) than if you optimized over the whole space of possible matrices $L$ in the first place to get the best transformation for your task at hand.

(It could happen however that doing a PCA preprocessing gives better results on the test set, because it could prevent overfitting in its own way, but I would still advise to do no PCA pre-processing and to regularize the problem by reducing the number of components of the metric-learning algorithm if needed. Also I think I remember trying empirically PCA preprocessing for dimensionality reduction before the metric-learning algorithm, and since it's unsupervised, the labels were quite mixed together in the smaller dimensionality space, making it hard(er) for the subsequent metric-learning algorithm to learn a good transformation)

bellet commented 3 years ago

Hi @FiratIsmailoglu, to complement @wdevazelhes's reply, you could try SCML (more precisely in your case, its supervised version SCML_Supervised), which we recently added to the library: http://contrib.scikit-learn.org/metric-learn/weakly_supervised.html#scml http://contrib.scikit-learn.org/metric-learn/generated/metric_learn.SCML_Supervised.html

It allows you to learn a metric as a weighted combination of simple rank-one matrices, which can either be constructed automatically from data (basis='lda') or given by you as input (e.g., based on knowledge of the problem). The advantage for high-dimensional data is that the number of parameters to learn depends on the number of said bases rather than the dimensionality, so if you pick a reasonable number of bases then the algorithm should scale more easily then other algorithms. You can check the above documentation as well as the original paper for more details: http://researchers.lille.inria.fr/abellet/papers/aaai14.pdf

With other supervised metric learning algorithms, as advised by @wdevazelhes, you should not use PCA pre-processing but rather set n_components to something much smaller than the data dimension.

Hope this helps Aurélien

FiratIsmailoglu commented 3 years ago

@wdevazelhes Hi William, thank you so much for your support. Upon your post, I immdeiatley gave up using PCA and gave it a try to reduce n_components. Sadly could not reduce the running time significantly, considering LMNN, NCA and NFDA in the supervised setting. Specifcally, I reduced n_component from say around 10K to 100, but did not observe a remarkable difference in terms of the time But thank you anyways:)

@bellet Aurelien, thank you for your suggestion and for your time. Honestly, in my first implemenations I tried the metric learning methods in the given example (plot_metric_learning_examples.py) only, so had no idea about SCML. However after your post I did try SCML and hada look at your paper. It really works for high dimensional datasets, it is really fast. My observation is, examples from the same classes are close to one another, while those from different classes are far apart in the transformed space created SCML; so I was expecting to see some improvments for the distance-based classifiers such as k-NN, LVQ, in the transformed space, but interestingly did not witness such imporvements. Maybe there is need for playing with the number of bases. Thank you once again.

wdevazelhes commented 3 years ago

@FiratIsmailoglu I'm sorry reducing the number of components didn't help, but I'm glad that SCML worked when observing the new distances! Yes, maybe playing around with the different parameters will help, even the parameters of k-NN and LVQ too, sometimes metric-learning algorithms work better for certain values of the number of neighbors in knn (though I don't know for SCML what would be the best way to improve the performance of a downstream classifier, @bellet what do you think ?)

For tweaking the parameters, you may have already used this, but metric-learn respects the scikit-learn API so you can make your classification predictor (e.g. SCML_Supervised + KNN) as a sklearn.pipeline.Pipeline and then use out of the box all scikit-learn utilities for cross-validation and grid-searching, as well as other packages specialized in hyperparameter optimization like scikit-optimize: https://scikit-optimize.github.io/stable/ I hope this helps!

bellet commented 3 years ago

Yes, tuning some of the hyperparameters of SCML_Supervised can probably help! In addition to n_bases, you may also want to adjust k_genuine and k_impostor=10 which define how many triplet constraints are generated from the labeled dataset.

terry07 commented 3 years ago

Thanks all for the comments and the related discussion. I am trying to understand better the basis and n_basis arguments and their role inside the SCML_supervised algorithm. Is this a manner of fixing the points that we want to use for computing the differences? Could you please provide me some more specific information or a relative sample with a toy dataset (e.g., wine dataset)? Furthermore, it is somewhat strange the fact that the documentation of the SCML_supervised algorithm contains an example of the weakly SCML variant, probably you would like to fix it. Thanks a lot again.

bellet commented 3 years ago

Thanks @terry07 for your interest in metric-learn. SCML learns a Mahalanobis distance where the learned matrix takes the form of a weighted sum of a fixed set of rank-1 "bases", see http://contrib.scikit-learn.org/metric-learn/weakly_supervised.html#scml (You can also check the above paper for more details about SCML)

In the supervised setting, in lack of specific knowledge about your task that could suggest the use of specific bases, we recommend using discriminative bases generated by LDA (default behavior with basis=‘lda’). You may additional tune the number of bases by changing the parameter n_basis if you like.

Thanks for pointing out the fact that we do not have a proper example with SCML_Supervised in the doc. Here's a simple one with wine dataset which shows much better KNN accuracy with the distance learned with SCML than with the Euclidean distance.

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from metric_learn import SCML_Supervised
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

knn_euc = KNeighborsClassifier()
knn_scml = make_pipeline(SCML_Supervised(), KNeighborsClassifier())

X, y = load_wine(return_X_y=True)
print("CV accuracy of KNN with Euclidean distance:", cross_val_score(knn_euc, X, y).mean())
print("CV accuracy of KNN with distance learned with SCML_Supervised:", cross_val_score(knn_scml, X, y).mean())

scikit-learn-contrib / metric-learn

Large Datasets in terms of Number of Attributes #307