scikit-learn-contrib / metric-learn

Metric learning algorithms in Python
http://contrib.scikit-learn.org/metric-learn/
MIT License
1.4k stars 233 forks source link

2. [MRG] Bilinear similarity #329

Open mvargas33 opened 3 years ago

mvargas33 commented 3 years ago

Hi!

In the aim of implementing OASIS algorithm, I open this PR to discuss the implementation of the BilinearMixin class. Which is a requirement for the algorithm.

I've made some testing but in a file outside the test folder, as I am not familiar with the correct way to test things out. These are basic testing for bilinear similarity with random and handmade toy examples.

As the matrix W in the bilinear similarity is not guaranteed to be symmetric nor PSD, I did not extend from the class MetricTransformer, and did not implemented the transform() method.

Also, for score_pairs() I propose two implementations, and I am not sure which one perms better.

Note: along the way this fixes #160

bellet commented 3 years ago

You should also ensure the compatibility with pairs, triplets and quadruplet classifiers to make sure we can then build actual bilinear similarity learning algorithms. Probably many existing tests for these will just naturally pass with bilinear similarity, while others may need to be adapted

mvargas33 commented 3 years ago

@bellet @perimosocordiae @wdevazelhes @terrytangyuan

There needs to be a broader discussion regarding adding BilinearSimilarity and Bilinear Similarity Learners such as OASIS.

By definition bilinear similarity works in the opposite direction as a metric such as the Mahalanobis.

Given two vectors u, v in Rn, if their bilinear similarity S(u,v) is positive and huge, it means the vectors are closer. On the opposite, if S(u,v) is negative, it means the vectors point in different directions, and are far away. In contrast with a metric where a huge value means the vectors are far far away.

Regarding the API, it means that score_pairs() works semantically in opposite directions for MahalanobisMixin and BilinearMixin while both implement it, as both inherit from BaseMetricLearner.

There are several paths that can be followed to deal with this:

  1. Add a -1 multiplication at the result of score_pairs in BilinearMixin, so the global semantic of that function across the whole package remains the same. And add a warning at the documentation.
  2. Don't change score_pairs() and return the original bilinear similarity: This may conflict when comparing different models, as the user needs to take cake of making an exceptional case for Bilinear learners considering that huge values of score_pairs means closer points.
  3. Don't inherit BilinearMixin from BaseMetricLearner as the linear similarity is not a metric, its semantically incorrect. Create a whole different class/part of the code for SimilarityLearners
  4. Other options I don't see right now

This is also a concern for Classifiers, as they rely on score_pairs() at decision_function() (and at predict() as consequence) for verifying triplets/cuadriplets. And measuring distance for pairs.

For example, if we choose 2) and not change score_pairs() for bilinear, we will have to use isinstance() at classifiers to make the difference for MahalanobiasMixin and the BilinearMixin. Another option is to build almost the same classifiers, but for BilinearMixin exclusively, duplicating a lot of code.

The decision made now, might imply many things for the future, moreover if there comes more bilinear learners implementations in the future, or if there are other similarities apart from bilinear and learners for those other similarities.

My humble opinion: I believe its more important to keep the global semantic of score_pairs(), even when the "metric" learned is in fact a similarity. I bet for the -1 multiplier in BilinearMixin and rely that the final user will see a huge and clear warning at Bilinear Learners. Maybe implement an extra function that returns the original similarities (without -1 mult.) although that idea seems silly.

Ps: By "semantic" I mean the meaning of what the function is intended to do. For instance, the current semantic of score_pairs() means that the lower the output value, the closer the pairs.

wdevazelhes commented 3 years ago

@bellet @perimosocordiae @wdevazelhes @terrytangyuan

There needs to be a broader discussion regarding adding BilinearSimilarity and Bilinear Similarity Learners such as OASIS.

By definition bilinear similarity works in the opposite direction as a metric such as the Mahalanobis.

Given two vectors u, v in Rn, if their bilinear similarity S(u,v) is positive and huge, it means the vectors are closer. On the opposite, if S(u,v) is negative, it means the vectors point in different directions, and are far away. In contrast with a metric where a huge value means the vectors are far far away.

Regarding the API, it means that score_pairs() works semantically in opposite directions for MahalanobisMixin and BilinearMixin while both implement it, as both inherit from BaseMetricLearner.

There are several paths that can be followed to deal with this:

  1. Add a -1 multiplication at the result of score_pairs in BilinearMixin, so the global semantic of that function across the whole package remains the same. And add a warning at the documentation.
  2. Don't change score_pairs() and return the original bilinear similarity: This may conflict when comparing different models, as the user needs to take cake of making an exceptional case for Bilinear learners considering that huge values of score_pairs means closer points.
  3. Don't inherit BilinearMixin from BaseMetricLearner as the linear similarity is not a metric, its semantically incorrect. Create a whole different class/part of the code for SimilarityLearners
  4. Other options I don't see right now

This is also a concern for Classifiers, as they rely on score_pairs() at decision_function() (and at predict() as consequence) for verifying triplets/cuadriplets. And measuring distance for pairs.

For example, if we choose 2) and not change score_pairs() for bilinear, we will have to use isinstance() at classifiers to make the difference for MahalanobiasMixin and the BilinearMixin. Another option is to build almost the same classifiers, but for BilinearMixin exclusively, duplicating a lot of code.

The decision made now, might imply many things for the future, moreover if there comes more bilinear learners implementations in the future, or if there are other similarities apart from bilinear and learners for those other similarities.

My humble opinion: I believe its more important to keep the global semantic of score_pairs(), even when the "metric" learned is in fact a similarity. I bet for the -1 multiplier in BilinearMixin and rely that the final user will see a huge and clear warning at Bilinear Learners. Maybe implement an extra function that returns the original similarities (without -1 mult.) although that idea seems silly.

Ps: By "semantic" I mean the meaning of what the function is intended to do. For instance, the current semantic of score_pairs() means that the lower the output value, the closer the pairs.

@mvargas33 I agree with you, IMHO I think it would be good to reuse the code and existing classes/methods as much as possible:

(Note: also regarding the name pairs_remoteness, I guess we could either have a generic name like this and say for Mahalanobis learners that it is equal to the true distance, OR have a precise name like pairs_distance and add a warning that says for Bilinar learners that it is not a true distance (in the mathematical sense, since (- bilinear similarity) can be negative) , I guess either way works)

NB: Probably the get_metric function would need to be adapted too, the same way as score_pairs ?

mvargas33 commented 3 years ago

@wdevazelhes

I like the idea of pairs_remoteness and pairs_similarity, because it adds one layer of abstraction, making the code more generic and being able to support metrics and similarites without complicating things too much. No change will be needed in Classifiers that way.

Even if this looks like an overkill right now, it guarantees that the code will be scalable and easily extendible in the future. For instance, imagine there are 5 bilinear similarity learners at some point, How this deals with the solution of -1 multiplier + warning? For me that sounds more like a patch.

Or what if there's support for data-structure similarities in the future, like tree-similarities or string-similarities. This adds more similarity types that could be easily handled with pairs_remoteness and pairs_similarity, no matter what the algorithm actually calculates in the end: closeness or remoteness.

I also like the idea of renaming BaseMetricLearner to BaseLearner, again, because the second its more abstract and semantically correct with metrics and similarities.

wdevazelhes commented 3 years ago

@mvargas33 That sounds good ! Yes that's right, having two general functions like this would make the code quite generic and extensible

bellet commented 3 years ago

Hi, thanks @mvargas33 and @wdevazelhes for kicking off the discussion. Here are some comments:

Would be great to hear @perimosocordiae and @terrytangyuan's take on this!

wdevazelhes commented 3 years ago

@bellet Yes ​I agree the deprecation sounds good, to force users to change their code, and I agree one general method that every learner would implement (like similarity, which respects more the meaning of a "score" indeed) may be enough for having generic code (for us, and for our users), and then a method like pairwise_distances for (pseudo)distance metric learners is indeed cool to have: (they even have such a function with the same name in sklearn too: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)

NB: this is also another scikit-learn doc that may be useful: https://scikit-learn.org/stable/modules/metrics.html#metrics

bellet commented 3 years ago

@wdevazelhes the thing is that a bilinear similarity is not guaranteed to yield a data transformation (and thus be a valid kernel). This is only true when the matrix is PSD, which is often not enforced by the similarity learning algorithms to avoid the computational cost with the projection into the PSD cone. So basically, a bilinear metric learner will not always be a transformer or valid kernel (and in practice, may often not be). We could still implement transform and raise an error when the matrix is not PSD, but this may be confusing for users?

@mvargas33 at this point of the discussion I think we all agree that it is a good idea to:

So perhaps you can go ahead with the above changes, and we can continue the discussion a bit regarding what to do for bilinear similarity (do we want to have specific things for when the matrix is actually PSD). @terrytangyuan @perimosocordiae your opinion is most welcome!

wdevazelhes commented 3 years ago

@wdevazelhes the thing is that a bilinear similarity is not guaranteed to yield a data transformation (and thus be a valid kernel). This is only true when the matrix is PSD, which is often not enforced by the similarity learning algorithms to avoid the computational cost with the projection into the PSD cone. So basically, a bilinear metric learner will not always be a transformer or valid kernel (and in practice, may often not be). We could still implement transform and raise an error when the matrix is not PSD, but this may be confusing for users?

@mvargas33 at this point of the discussion I think we all agree that it is a good idea to:

  • deprecate the score_pairs method
  • add instead a pairwise_similarities method which follows the convention (the higher, the more similar)
  • add a pairwise_distances method which is only implemented for metric learner that learn an actual (pseudo) distance, ie implemented for Mahalanobis but not bilinear similarity.

So perhaps you can go ahead with the above changes, and we can continue the discussion a bit regarding what to do for bilinear similarity (do we want to have specific things for when the matrix is actually PSD). @terrytangyuan @perimosocordiae your opinion is most welcome!

@bellet that's right, sorry ! I didn't realize the PSD condition is not necessarily verified here I agree with the points above, for the specific case where the matrix is actually PSD, yes I agree that may confuse the users a bit, it may feel weird to have methods that sometimes work and sometimes (often) not

bellet commented 3 years ago

This is always a feature we can add later if we feel it is needed. @mvargas33 I think we have some kind of consensus now so you may go ahead and start implementing these changes

mvargas33 commented 3 years ago

This is always a feature we can add later if we feel it is needed. @mvargas33 I think we have some kind of consensus now so you may go ahead and start implementing these changes

Yeah, if needed in the future, this particular feature should not take too much time to implement.

Btw, I already implemented these changes at #333 . All tests pass. The deprecation warning is being shown when the user directly call score_pairs. Classifiers use pair_distance now and in the docs, all practical usage of score_pairs was replaced with pair_distance

mvargas33 commented 2 years ago

@bellet @perimosocordiae @wdevazelhes @terrytangyuan

This branch is ready for a code review. It would be cool if can spare some time to look at it.

Here's a summary of the changes:

And the summary for the tests:

And after digging deep in the tests these days, the parametrization list mentioned at #136 is already live for most tests. But parametrization for metric_learn_test, test_fit_transform and test_transformer_metric_conversion is missing.

I think that parametrize fit_transform is not that hard with the lists as they are, and with build_dataset already in test_utils.py. But I'll prioritize #330 for now, the PR just after this one.

Best! 😁

bellet commented 2 years ago
* There are also some test that are identical for Mahalanobis and Bilinear learners, but they are in their respective test file. Should we have a test file for tests regarding all kinds of learners? Now there is none, but all learners are tested in classifiers tests. Take as an example: `test_pair_score_dim` and `test_deprecated_score_pairs_same_result`.

Yes, tests that are common to Mahalanobis and Bilinear linears (and thus most likely to any metric learner) should not be duplicated and placed in test_base_metric.py instead

mvargas33 commented 2 years ago

Here's a review for the documentation part.

Generally we need to change a bit the vocabulary to accommodate both Mahalanobis and bilinear similarities. As common in the literature I propose to use the term "metric" as a generic term (encompassing both distances and similarities).

In addition to the changes mentioned below, some changes are needed in the beginning of "What is metric learning?":

Sec 1

* Many approaches in machine learning require a measure of distance (or similarity) between data points.

* choose a standard distance metric --> choose a standard metric

* Distance metric learning (or simply, metric learning) --> Metric learning

* task-specific distance metrics --> task-specific metrics

* The learned distance metric --> the learned metric

Sec 1.1:

* the goal in this setting is to learn a distance metric --> a metric

* the goal is to learn a distance metric --> a metric

* find the parameters of a distance function --> of a metric function

Please do spellcheck to avoid typos!

All observations were resolved for introduction.rst, supervised.rst and weakly_supervised.rst.

Will run the spellchecker soon as well as moving the common tests to test_base_metric.py

mvargas33 commented 2 years ago
bellet commented 2 years ago

@wdevazelhes @perimosocordiae @terrytangyuan it would be great if one of you can review this PR as well in the next couple of weeks (@mvargas33 will be off for ~2 weeks - so he can finalize the PR once he is back)

perimosocordiae commented 2 years ago

I resolved merge conflicts, but there are still a bunch of review items to address here.