scikit-learn-contrib / metric-learn

Metric learning algorithms in Python
http://contrib.scikit-learn.org/metric-learn/
MIT License
1.4k stars 233 forks source link

Approximating precomputed metric #158

Open adamhaber opened 5 years ago

adamhaber commented 5 years ago

I'm trying to use metric-learn to learn a mahalanobis-metric approximation for a precomputed distance metric (that's computed in an different way). Is this functionality supported by the library? I tried to find it in the documentation but couldn't find something relevant. If anyone has any experience with something similar, I'd be happy to hear about it.

perimosocordiae commented 5 years ago

I'm not sure if there's a standard approach in the literature for this sort of thing, but one thing you could try right now is using your existing distance to generate constraints (i.e., dist(a,b) < dist(c,b)) which can then be used with several of the methods in this library.

adamhaber commented 5 years ago

Interesting, I'll try that, thanks! The problem is that my current precomputed distance matrix is 1000x1000, and I'm not sure if there's any particular way to choose which are the "better" constraints to use (assuming all of them is too much, and probably infeasible since my distance estimation can be noisy). I guess I can start with a random subset and starting adding.

My current approach is using SGD to find the best PSD matrix, in the sense that the distances it predicts are the closest (say, in MSE) to the precomputed matrix. Any thoughts?

perimosocordiae commented 5 years ago

You could try starting with a kNN subset of your precomputed distance matrix. So each row i would have k similar points and 1000 - k dissimilar points, from which you could produce constraints like dist(X[i], X[i_neighbor]) < dist(X[i], X[i_non_neighbor]).

Some of the methods here assume that you're doing a multi-class learning problem, but the ones that take "constraints" rather than "labels" should work well for you.

Your approach seems reasonable, though given a noisy target matrix you may be better off using a more coarse-grained loss function.

bellet commented 5 years ago

This is an interesting problem, which I think can have various applications. @adamhaber Do you mind explaining a bit the context and use-case, and why you want to approximate a precomputed distance? Is it because it is too costly to compute? Or because you only know the values for some particular points and want to generalize to new points? I have seen a similar use-case before which was to approximate a perceptual distance between colors: https://www.researchgate.net/profile/Amaury_Habrard/publication/267337425_Modeling_Perceptual_Color_Differences_by_Local_Metric_Learning/links/544d4a470cf2bcc9b1d8e96f.pdf?disableCoverPage=true

I agree with @perimosocordiae that the best way to go with what we currently have in metric-learn is to generate pairwise/triplet constraints from your precomputed distance matrix. This will learn a distance which does not necessarily match the values of the target distance but will tend to preserve the ordering of distances.

If matching the values themselves matter, then the approach you described based on minimizing MSE seems best. I am thinking that including such an approach into the package may be a good idea, as this "approximate existing distance" use-case may be of general interest. It may even be possible to implement this by calling existing least-square solvers.

bellet commented 5 years ago

Of course if you do not care about being able to generalize to new points (and are willing to ignore the feature representation of your points), you could use MDS: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html

adamhaber commented 5 years ago

Thanks for the link @bellet , seems super interesting!

As for the reason for learning a precomputed metric - exactly the two reasons you've mentioned. :-) I am indeed interested in learning the values themselves (and generalising for unseen pairs). At the moment I've mostly seen different eigenvalues-based and convex optimisation metric learning method, but I'm not sure these scale very well (still reading about this). I'll update if I find anything interesting.

Anyway, if you're planning on adding this as a feature to the package, I'd be happy to take part.

bellet commented 5 years ago

SGD applied to the MSE loss you described should scale nicely with the number of points. Maybe you mean scaling wrt to the number of features?

Do keep us posted if you have some clean and working code for this, so that you (if you have time), or someone else, could then try to integrate it into metric-learn. We would need to think about the best option in terms of API (we would probably need a new class for a pairwise metric learner where the labels are real numbers instead of binary)

adamhaber commented 5 years ago

Indeed, I mean scaling with the number of features.

SGD scales well with the number of features and the number of points, but it's not clear how to impose the PSD constraint. I tried running SGD on the Cholesky factors, but results weren't very good. I also tried using different convex optimisation solvers (CVXPY, MOSEK, etc) but these don't scale well with the number of features and the number of constraints I have (which is quadratic in the number of points, so easily in the tens of thousands constraints).

bellet commented 5 years ago

How many features do you have? Enough to make the projection onto the PSD cone too costly? If it is bearable a common trick is to only do the projection every few iterations of SGD rather than at each iteration.

adamhaber commented 5 years ago

For a medium size problem, I have roughly 100 features (meaning the Mahalanobis matrix is 100x100), and around 500,000 distances (whose MSE is my objective). Of course, I'd be more than happy to scale this to larger problems. :-) Just not sure how, at the moment.

bellet commented 5 years ago

Projection onto the PSD cone (based on the SVD of M) is quite scalable when the feature space is in the hundreds, even a few thousands. For higher dimensions, you can avoid the projection with some specific parameterization and/or algorithms, see for instance http://researchers.lille.inria.fr/abellet/papers/aaai14.pdf https://arxiv.org/pdf/1807.07789.pdf

adamhaber commented 5 years ago

Thanks! Are you aware of any efficient software implementation of this? Running SGD on the matrix to optimise the MSE is pretty straightforward (and scalable) with Pytorch/TF (though I've had problems with vanishing gradients), but I'm not sure how to do the projection...

bellet commented 5 years ago

These are algorithms we should probably integrate into metric-learn in the future. I have some old Matlab code for the first one, which should scale well as some key parts are implemented in C: https://github.com/bellet/SCML

For running SGD on MSE loss with projection, you could try to write a simple Python implementation, which should scale well except for the SGD for loop (you could always write this part in Cython if needed). And then you can contribute it to metric-learn ;-)