2. [MRG] Bilinear similarity

mvargas33 commented 3 years ago

Hi!

In the aim of implementing OASIS algorithm, I open this PR to discuss the implementation of the BilinearMixin class. Which is a requirement for the algorithm.

I've made some testing but in a file outside the test folder, as I am not familiar with the correct way to test things out. These are basic testing for bilinear similarity with random and handmade toy examples.

As the matrix W in the bilinear similarity is not guaranteed to be symmetric nor PSD, I did not extend from the class MetricTransformer, and did not implemented the transform() method.

Also, for score_pairs() I propose two implementations, and I am not sure which one perms better.

Note: along the way this fixes #160

bellet commented 3 years ago

You should also ensure the compatibility with pairs, triplets and quadruplet classifiers to make sure we can then build actual bilinear similarity learning algorithms. Probably many existing tests for these will just naturally pass with bilinear similarity, while others may need to be adapted

mvargas33 commented 3 years ago

@bellet @perimosocordiae @wdevazelhes @terrytangyuan

There needs to be a broader discussion regarding adding BilinearSimilarity and Bilinear Similarity Learners such as OASIS.

By definition bilinear similarity works in the opposite direction as a metric such as the Mahalanobis.

Given two vectors u, v in Rn, if their bilinear similarity S(u,v) is positive and huge, it means the vectors are closer. On the opposite, if S(u,v) is negative, it means the vectors point in different directions, and are far away. In contrast with a metric where a huge value means the vectors are far far away.

Regarding the API, it means that score_pairs() works semantically in opposite directions for MahalanobisMixin and BilinearMixin while both implement it, as both inherit from BaseMetricLearner.

There are several paths that can be followed to deal with this:

Add a -1 multiplication at the result of score_pairs in BilinearMixin, so the global semantic of that function across the whole package remains the same. And add a warning at the documentation.
Don't change score_pairs() and return the original bilinear similarity: This may conflict when comparing different models, as the user needs to take cake of making an exceptional case for Bilinear learners considering that huge values of score_pairs means closer points.
Don't inherit BilinearMixin from BaseMetricLearner as the linear similarity is not a metric, its semantically incorrect. Create a whole different class/part of the code for SimilarityLearners
Other options I don't see right now

This is also a concern for Classifiers, as they rely on score_pairs() at decision_function() (and at predict() as consequence) for verifying triplets/cuadriplets. And measuring distance for pairs.

For example, if we choose 2) and not change score_pairs() for bilinear, we will have to use isinstance() at classifiers to make the difference for MahalanobiasMixin and the BilinearMixin. Another option is to build almost the same classifiers, but for BilinearMixin exclusively, duplicating a lot of code.

The decision made now, might imply many things for the future, moreover if there comes more bilinear learners implementations in the future, or if there are other similarities apart from bilinear and learners for those other similarities.

My humble opinion: I believe its more important to keep the global semantic of score_pairs(), even when the "metric" learned is in fact a similarity. I bet for the -1 multiplier in BilinearMixin and rely that the final user will see a huge and clear warning at Bilinear Learners. Maybe implement an extra function that returns the original similarities (without -1 mult.) although that idea seems silly.

Ps: By "semantic" I mean the meaning of what the function is intended to do. For instance, the current semantic of score_pairs() means that the lower the output value, the closer the pairs.

wdevazelhes commented 3 years ago

@bellet @perimosocordiae @wdevazelhes @terrytangyuan

There needs to be a broader discussion regarding adding BilinearSimilarity and Bilinear Similarity Learners such as OASIS.

By definition bilinear similarity works in the opposite direction as a metric such as the Mahalanobis.

Given two vectors u, v in Rn, if their bilinear similarity S(u,v) is positive and huge, it means the vectors are closer. On the opposite, if S(u,v) is negative, it means the vectors point in different directions, and are far away. In contrast with a metric where a huge value means the vectors are far far away.

Regarding the API, it means that score_pairs() works semantically in opposite directions for MahalanobisMixin and BilinearMixin while both implement it, as both inherit from BaseMetricLearner.

There are several paths that can be followed to deal with this:

Add a -1 multiplication at the result of score_pairs in BilinearMixin, so the global semantic of that function across the whole package remains the same. And add a warning at the documentation.

Don't change score_pairs() and return the original bilinear similarity: This may conflict when comparing different models, as the user needs to take cake of making an exceptional case for Bilinear learners considering that huge values of score_pairs means closer points.

Don't inherit BilinearMixin from BaseMetricLearner as the linear similarity is not a metric, its semantically incorrect. Create a whole different class/part of the code for SimilarityLearners

Other options I don't see right now

This is also a concern for Classifiers, as they rely on score_pairs() at decision_function() (and at predict() as consequence) for verifying triplets/cuadriplets. And measuring distance for pairs.

For example, if we choose 2) and not change score_pairs() for bilinear, we will have to use isinstance() at classifiers to make the difference for MahalanobiasMixin and the BilinearMixin. Another option is to build almost the same classifiers, but for BilinearMixin exclusively, duplicating a lot of code.

The decision made now, might imply many things for the future, moreover if there comes more bilinear learners implementations in the future, or if there are other similarities apart from bilinear and learners for those other similarities.

My humble opinion: I believe its more important to keep the global semantic of score_pairs(), even when the "metric" learned is in fact a similarity. I bet for the -1 multiplier in BilinearMixin and rely that the final user will see a huge and clear warning at Bilinear Learners. Maybe implement an extra function that returns the original similarities (without -1 mult.) although that idea seems silly.

Ps: By "semantic" I mean the meaning of what the function is intended to do. For instance, the current semantic of score_pairs() means that the lower the output value, the closer the pairs.

@mvargas33 I agree with you, IMHO I think it would be good to reuse the code and existing classes/methods as much as possible:

Regarding whether to multiply by -1 or not for score_pairs for BilinearMixin,I agree it would be good, so that indeed score_pairs is "in the same direction" for all metric learners, which would make the code for Classifiers keep working for Bilinear learners, and allow users to write some generic code with score_pairs. I agree that, as a consequence, we would need to ensure that users know what they're doing when using score_pairs, and a warning in the documentation sounds good. → Maybe also an idea for even more clarity would be to revisit score_pairs by changing its name to something more self-explanatory like pairs_remoteness (this is probably a bad name but any synonym of distance would work (distance itself may maybe not be ideal if we want to return things that are not true distances), to keep the idea "small=closer, big=further away). And maybe having the opposite function as you suggested like say pairs_similarity would be good too: it can be good because in the documentation for the instantiated versions of those methods, we would for instance say for Bilinear learners that pairs_similarity is the actual bilinear similarity (from the math formula), and for Mahalanobis learners that pairs_remoteness is the actual mahalanobis distance (from the math formula), and for the remaining functions (pairs_remoteness for Bilinear learners and pairs_similarity for the Mahalanobis learners), we would just say it's the opposite of the other one, given for convenience. This way may be a good compromise between having on one hand a generic code (useful for us in classes like Classifier and for users who write generic code), and on the other hand being very transparent and clear to the user on what function does what And this way we may even not need the warning. But maybe having two methods is a bit too much, I'm not sure, I guess either way are good (2 methods, or 1 method with the right warning in docs)

(Note: also regarding the name pairs_remoteness, I guess we could either have a generic name like this and say for Mahalanobis learners that it is equal to the true distance, OR have a precise name like pairs_distance and add a warning that says for Bilinar learners that it is not a true distance (in the mathematical sense, since (- bilinear similarity) can be negative) , I guess either way works)

NB: Probably the get_metric function would need to be adapted too, the same way as score_pairs ?

Regarding inheriting from BaseMetricLearner, I think it would be good indeed, we can even see BaseMetricLearner not as a generic class for distance metric learners, but like for learners of the package "metric-learn" :D, so in this case Bilinear learners fits in ;) (Or otherwise we could rename that abstract class too into just BaseLearner for instance ;) )

mvargas33 commented 3 years ago

@wdevazelhes

I like the idea of pairs_remoteness and pairs_similarity, because it adds one layer of abstraction, making the code more generic and being able to support metrics and similarites without complicating things too much. No change will be needed in Classifiers that way.

Even if this looks like an overkill right now, it guarantees that the code will be scalable and easily extendible in the future. For instance, imagine there are 5 bilinear similarity learners at some point, How this deals with the solution of -1 multiplier + warning? For me that sounds more like a patch.

Or what if there's support for data-structure similarities in the future, like tree-similarities or string-similarities. This adds more similarity types that could be easily handled with pairs_remoteness and pairs_similarity, no matter what the algorithm actually calculates in the end: closeness or remoteness.

I also like the idea of renaming BaseMetricLearner to BaseLearner, again, because the second its more abstract and semantically correct with metrics and similarities.

wdevazelhes commented 3 years ago

@mvargas33 That sounds good ! Yes that's right, having two general functions like this would make the code quite generic and extensible

bellet commented 3 years ago

Hi, thanks @mvargas33 and @wdevazelhes for kicking off the discussion. Here are some comments:

I agree that we should act at the level of the abstract class BaseMetricLearner and make the semantic of the abstract methods consistent, so that everything downstream (e.g., classifiers) work seamlessly.
I think we chose the current name of the method score_pairs in analogy to sklearn's score method, but somehow the semantic is reversed compared to the latter when considering Mahalanobis learners (larger score means less positive), which is counter-intuitive. In hindsight I don't think we did the right choice here and I am in favor of changing this one way or another.
Given the above, we have several choices.
1. Either we stick to score_pairs but adjust the semantic to always follow the convention "larger = more similar". This is the easiest change but it means changing the current behavior for Mahalanobis learners, which may break existing code (of course we can add warnings but this may not be enough for people to become aware of the problem).
2. Or we decide it is better to change the name of the method (and deprecate score_pairs, giving users time to adapt). We could use pair_score or pairwise_score if we want to stick close to what we had. As suggested by @wdevazelhes, we could also consider introducing two methods (one for similarity and one for dissimilarity/distance). Maybe pair(wise)_score could be the default and we could implement pair(wise)_distance only when the metric is actually a true (pseudo) distance? Otherwise, the two methods may be the same in practice up to a minus sign, which is not all that useful. What do you think?
"Metric learning" have been used extensively in the literature as broad terms encompassing both metric learning and similarity learning. Note incidentally that in general Mahalanobis "distances" are not true metrics (only pseudo-distances). I think we should follow this terminology (probably explain this clearly in the doc), and thus for instance keep the name BaseMetricLearner (the name BaseLearner becomes way too generic IMO). After all, the library is called metric_learn :-)
Similarly, I am not convinced at this point that get_metric needs to be changed. BTW, is there any sklearn estimator that has an argument for providing a similarity instead of a metric?

Would be great to hear @perimosocordiae and @terrytangyuan's take on this!

wdevazelhes commented 3 years ago

@bellet Yes I agree the deprecation sounds good, to force users to change their code, and I agree one general method that every learner would implement (like similarity, which respects more the meaning of a "score" indeed) may be enough for having generic code (for us, and for our users), and then a method like pairwise_distances for (pseudo)distance metric learners is indeed cool to have: (they even have such a function with the same name in sklearn too: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)

For get_metric you're right, in the end this would be mostly be used as a function in sklearn's estimators so no real need to extend it I guess
Regarding your question about whether sklearn's estimators sometimes take in a similarity: I just checked and it seems it can, with for instance GaussianProcesses (see https://scikit-learn.org/stable/modules/gaussian_process.html#kernels-for-gaussian-processes): there's even an interesting paragraph in the user guide about API, I didn't fully understood how it all works but I seems like it could definitely be a source of inspiration: All Gaussian process kernels are interoperable with sklearn.metrics.pairwise and vice versa: instances of subclasses of Kernel can be passed as metric to pairwise_kernels from sklearn.metrics.pairwise. Moreover, kernel functions from pairwise can be used as GP kernels by using the wrapper class PairwiseKernel. Also, it seems that bilinear metrics learn in fact a linear kernel in a linearly transformed space is that right ? If yes I guess it'd be cool if they return an object that can be used as a kernel in scikit-learn GaussianProcess ! Also, this part of the documentation seems interesting: https://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-kernel-api It seems that in returning the kernel object, it is even possible to specify what are some hyperparameters (in this case maybe the matrix A in xTAy ?), and what is the gradient of the kernel function with respect to them: this may mean that scikit-learn could fine-tune a kernel previously learnt by say OASIS, in order to optimize a gaussian process, (not sure if that would work, even if that would be useful in some real life examples, but it definitely shows the power of the API!)

NB: this is also another scikit-learn doc that may be useful: https://scikit-learn.org/stable/modules/metrics.html#metrics

bellet commented 3 years ago

@wdevazelhes the thing is that a bilinear similarity is not guaranteed to yield a data transformation (and thus be a valid kernel). This is only true when the matrix is PSD, which is often not enforced by the similarity learning algorithms to avoid the computational cost with the projection into the PSD cone. So basically, a bilinear metric learner will not always be a transformer or valid kernel (and in practice, may often not be). We could still implement transform and raise an error when the matrix is not PSD, but this may be confusing for users?

@mvargas33 at this point of the discussion I think we all agree that it is a good idea to:

deprecate the score_pairs method
add instead a pairwise_similarities method which follows the convention (the higher, the more similar)
add a pairwise_distances method which is only implemented for metric learner that learn an actual (pseudo) distance, ie implemented for Mahalanobis but not bilinear similarity.

So perhaps you can go ahead with the above changes, and we can continue the discussion a bit regarding what to do for bilinear similarity (do we want to have specific things for when the matrix is actually PSD). @terrytangyuan @perimosocordiae your opinion is most welcome!

wdevazelhes commented 3 years ago

@wdevazelhes the thing is that a bilinear similarity is not guaranteed to yield a data transformation (and thus be a valid kernel). This is only true when the matrix is PSD, which is often not enforced by the similarity learning algorithms to avoid the computational cost with the projection into the PSD cone. So basically, a bilinear metric learner will not always be a transformer or valid kernel (and in practice, may often not be). We could still implement transform and raise an error when the matrix is not PSD, but this may be confusing for users?

@mvargas33 at this point of the discussion I think we all agree that it is a good idea to:

deprecate the score_pairs method

add instead a pairwise_similarities method which follows the convention (the higher, the more similar)

add a pairwise_distances method which is only implemented for metric learner that learn an actual (pseudo) distance, ie implemented for Mahalanobis but not bilinear similarity.

So perhaps you can go ahead with the above changes, and we can continue the discussion a bit regarding what to do for bilinear similarity (do we want to have specific things for when the matrix is actually PSD). @terrytangyuan @perimosocordiae your opinion is most welcome!

@bellet that's right, sorry ! I didn't realize the PSD condition is not necessarily verified here I agree with the points above, for the specific case where the matrix is actually PSD, yes I agree that may confuse the users a bit, it may feel weird to have methods that sometimes work and sometimes (often) not

bellet commented 3 years ago

This is always a feature we can add later if we feel it is needed. @mvargas33 I think we have some kind of consensus now so you may go ahead and start implementing these changes

mvargas33 commented 3 years ago

This is always a feature we can add later if we feel it is needed. @mvargas33 I think we have some kind of consensus now so you may go ahead and start implementing these changes

Yeah, if needed in the future, this particular feature should not take too much time to implement.

Btw, I already implemented these changes at #333 . All tests pass. The deprecation warning is being shown when the user directly call score_pairs. Classifiers use pair_distance now and in the docs, all practical usage of score_pairs was replaced with pair_distance

mvargas33 commented 2 years ago

@bellet @perimosocordiae @wdevazelhes @terrytangyuan

This branch is ready for a code review. It would be cool if can spare some time to look at it.

Here's a summary of the changes:

Introduction of BilinearMixin in base_metric.py.
Documentation for future Bilinear Similarity learners in introduction.rst, supervised.rst and weakly_supervised.rst. The idea is to generalize the learner concept and to make crystal clear that not all learners have pair_distance, but everyone has pair_score.

And the summary for the tests:

Made mock objects for BilinearMixin at all levels, including tuple learners. This, to guarantee that in any case in the future, the behavior of the tests is the same for any kind of Bilinear similarity learners that might be implemented. Taking into account that implementing one algorithm changes fit mainly.
With these mock learners, there are several tests in test_bilinear_mixin.py to test the correctness of the BilinearMixin. Some of them are hardcoded for symmetry reasons.
Refactor of algorithm lists: There is a list of Mahalanobis and Bilinear learners separately, and together. This solves #160 and make the tests scalable. Only Mahalanobis learners are tested in test_mahalanobis_mixin.py thanks to the list, and for instance, everything inside test pairs, triplets and quadruplets_classifiers.py are tested with both kind of learners.
Tests refactors: Some tests call transform and pair_distance inside, so I had to duplicate the test, to test these methods separately for Mahalanobis learners. For instance test_array_like_inputs_mahalanobis and test_same_with_or_without_preprocessor_mahalanobis.
There are also some test that are identical for Mahalanobis and Bilinear learners, but they are in their respective test file. Should we have a test file for tests regarding all kinds of learners? Now there is none, but all learners are tested in classifiers tests. Take as an example: test_pair_score_dim and test_deprecated_score_pairs_same_result.

And after digging deep in the tests these days, the parametrization list mentioned at #136 is already live for most tests. But parametrization for metric_learn_test, test_fit_transform and test_transformer_metric_conversion is missing.

I think that parametrize fit_transform is not that hard with the lists as they are, and with build_dataset already in test_utils.py. But I'll prioritize #330 for now, the PR just after this one.

Best! 😁

bellet commented 2 years ago

* There are also some test that are identical for Mahalanobis and Bilinear learners, but they are in their respective test file. Should we have a test file for tests regarding all kinds of learners? Now there is none, but all learners are tested in classifiers tests. Take as an example: `test_pair_score_dim` and `test_deprecated_score_pairs_same_result`.

Yes, tests that are common to Mahalanobis and Bilinear linears (and thus most likely to any metric learner) should not be duplicated and placed in test_base_metric.py instead

mvargas33 commented 2 years ago

Here's a review for the documentation part.

Generally we need to change a bit the vocabulary to accommodate both Mahalanobis and bilinear similarities. As common in the literature I propose to use the term "metric" as a generic term (encompassing both distances and similarities).

In addition to the changes mentioned below, some changes are needed in the beginning of "What is metric learning?":

Sec 1
* Many approaches in machine learning require a measure of distance (or similarity) between data points.

* choose a standard distance metric --> choose a standard metric

* Distance metric learning (or simply, metric learning) --> Metric learning

* task-specific distance metrics --> task-specific metrics

* The learned distance metric --> the learned metric
Sec 1.1:
* the goal in this setting is to learn a distance metric --> a metric

* the goal is to learn a distance metric --> a metric

* find the parameters of a distance function --> of a metric function
Please do spellcheck to avoid typos!

All observations were resolved for introduction.rst, supervised.rst and weakly_supervised.rst.

Will run the spellchecker soon as well as moving the common tests to test_base_metric.py

mvargas33 commented 2 years ago

I ran the spellcheck and fixed a few typos.
I moved test_pair_score_dim and test_deprecated_score_pairs_same_result to test_base_metric.py
I refactored test_same_with_or_without_preprocessor to run with all metric learners and used isinstance(estimator, MahalanobisMixin) to extend it for transform and pair_distance, but avoiding having a duplicated/separated test for Mahalanobis/Bilinear only.

bellet commented 2 years ago

@wdevazelhes @perimosocordiae @terrytangyuan it would be great if one of you can review this PR as well in the next couple of weeks (@mvargas33 will be off for ~2 weeks - so he can finalize the PR once he is back)

perimosocordiae commented 2 years ago

I resolved merge conflicts, but there are still a bunch of review items to address here.

scikit-learn-contrib / metric-learn

2. [MRG] Bilinear similarity #329