New LMNN version - Githubissues

johny-c commented 3 years ago

Hi all,

this is a PR to replace LMNN. I have submitted this version also to scikit-learn some time ago (see https://github.com/scikit-learn/scikit-learn/pull/8602) and @bellet suggested to submit here too. I removed some of the bells and whistles submitted to scikit-learn to make the code more readable and more easily maintainable, while following the conventions of metric-learn.

There are currently 9 tests failing, all on the same test function, namely the components initialization via PCA or LDA. I'm not sure what is the appropriate way to handle these failures in metric-learn. I'm guessing other algorithms have faced the same failures, so I'm starting this PR as is and await for comments.

FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-3-5-3-7] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-3-5-5-7] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-3-5-7-7] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-3-5-11-7] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-5-7-7] - ValueError: n_components=5 must be between 0 and min(n_samples, n_features)=4 with svd_solver='full'
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-5-11-7] - ValueError: n_components=5 must be between 0 and min(n_samples, n_features)=4 with svd_solver='full'
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-7-5-11] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-7-7-11] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-7-11-11] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).

wdevazelhes commented 3 years ago

Thanks for the PR @johny-c ! 👍

The failures seem to happen because test_auto_init_transformation creates toy examples with very few samples, and it happens that they contain singleton classes, which are removed by the method _validate_params from this PR, so then they don't have the right format (e.g. number of classes) for the initializations like LDA

Putting the line X, y, classes = self._validate_params(X, y) just after self.components_ = _initialize_components... instead of after X, y = self._prepare_inputs(X.... seems to solve the pb. That is, initializing the components_ with the original data, and pruning it only for the processing done for LMNN itself. However this would change a bit the behaviour of the initialization (which, after this modif, would take into account the singleton classes). But I think it makes sense, if we are able to fit such an init with singleton classes, to do so, no ?

That said, I'm wondering if we shouldn't allow the singleton classes to be taken also into account in LMNN ? We can't form similar pairs from those samples (since there's just 1), but we could maybe use them in dissimilar pairs, no ? (If we don't take them into account nothing will prevent those singletons to be inside a cluster of points from a different class)

PS: the CI tests are failing because we need to update the repo to match the new scikit-learn API (see #311), we'll fix that soon

johny-c commented 3 years ago

Hi @wdevazelhes,

thanks for spotting the issue. I think the first solution would be best, namely initializing the components with all samples and then removing the singleton class samples. I think this can only negatively affect the algorithm if the initialization uses class labels; namely until now only LDA.

Regarding taking into account singleton classes in LMNN as negative samples: I guess this could be done, but I think it would make the algorithm and code more complex and I'm uncertain about any potential gains. Arguably, if one intends to use an algorithm that works with triplets, they would make sure that each class has at least 2 samples. I would even argue that, throwing an exception if this is not satisfied, would make sense. In the PR in scikit-learn, the solution suggested and implemented was to just remove the singleton samples from the beginning.

wdevazelhes commented 3 years ago

@johny-c, thanks for updating the PR, I should have a bit more time by the middle of next week to fix #311, so that we could see the checks here go green,

About taking into account singleton classes as negative examples, I'm bothered for instance by use cases like person re-identification which is a use case for which many users are using metric-learn (https://github.com/scikit-learn-contrib/metric-learn/network/dependents?package_id=UGFja2FnZS01MjI5NjIxOA%3D%3D). In those cases, I think it would still be useful to take singleton classes into account, because there can be a lot of them: for instance the "labeled faces in the wild" dataset contains 5749 people in total but only 1680 people with two or more images, (see http://vis-www.cs.umass.edu/lfw/#information), therefore, in this case dropping singleton cases would mean droping 4069 samples... I fear that in this case dropping the singleton classes could cause a significative drop in performance...

johny-c commented 3 years ago

@wdevazelhes, that's a good point. In practice, this can be done by considering classes to be the non-singleton classes when iterating over them in find_impostors and considering impostor candidates with y != class_id instead of y > class_id. The issue with this change will be in terms of efficiency. I had tried this version in an early version of the scikit-learn PR and it makes the code about half as fast (which makes sense). Maybe, this should be controlled by another user parameter (e.g. use_singletons)? I'm not sure.

wdevazelhes commented 3 years ago

@johny-c good question, I guess indeed it would be problematic if LMNN is too long to run,

But I think that dropping only singleton classes would modify the training data in a biased way (i.e. the modified training data would not really be a true iid "subsample" of the original data anymore), so maybe instead we would rather want to train on a random subsample from the original data ? If so we could let the LMNN as-is (the version that keeps singleton classes), and let the users subsample their data by themselves.

However this is maybe not ideal because in this case if there were a lot of classes with just two elements they may become singleton classes and therefore we may lose many points for which we can form target neighbors...

In the end, in order to keep the code simple, and keep the data as much as possible as is, I'm thinking maybe we could drop a warning if there are a lot of singleton classes, sth like The dataset contains many singleton classes (XX, or YY% of the dataset), which will only be used when computing impostors. Training may be longer because of them, you may want to drop some of them to make the training faster. ? Then it's up to the users to use their own strategy for dropping singleton classes ?

But I'm not really sure...

@bellet, @perimosocordiae , @terrytangyuan , @nvauquie , what do you think about this ?

terrytangyuan commented 3 years ago

Yes, I think a warning is fine in this case to keep the code less complex. We should probably not modify the original data and leave this up to the user to clean/decide what strategy to use.

angelotc commented 3 years ago

Hello! When trying your PR, I get the following error. Ran from Google Collab:

Code:

from sklearn.datasets import make_friedman1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
!pip install git+https://github.com/johny-c/metric-learn.git

def friedman_np_to_df(X,y):
  return pd.DataFrame(X,columns=['x0','x1', 'x2', 'x3', 'x4']), pd.Series(y)

# Make training set. We don't care about Y so call it NA.
X_train, NA = make_friedman1(n_samples=1000, n_features=5, random_state = 1) 
X_train, NA = friedman_np_to_df(X_train,NA)

#categorize training set based off of x0
domain_list = []
for i in range(len(X_train)):
  if X_train.iloc[i]['x0'] < 0.6 :
    domain_list.append(1)
  else:
    domain_list.append(0)

X_train['domain'] = domain_list

# Set training set to where domain == 1 (x0 < 0.6), but also add ~60 out-of-domain samples (X_train['domain'] == 1 )

out_of_domain = X_train[X_train['domain'] == 0][:60]
X_train =  X_train[X_train['domain']==1]
X_train = pd.concat([out_of_domain, X_train])
y_train = X_train.copy()
X_train = X_train.drop(columns = ['domain'])
y_train = y_train['domain']

# Make testing set with a different random_state
X_test, NA2 = make_friedman1(n_samples=1000, n_features=5, random_state = 3)
X_test, NA2 = friedman_np_to_df(X_test,NA2)

#categorize testing set based off of x0
domain_list = []
for i in range(len(X_test)):
  if X_test.iloc[i]['x0'] < 0.6:
    domain_list.append(1)
  else:
    domain_list.append(0)
X_test['domain'] = domain_list

y_test = X_test['domain'].copy()
X_test = X_test.drop(columns = ['domain'])

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from metric_learn import LMNN
lmnn_knn = Pipeline(steps=[('lmnn', LMNN()), ('knn', KNeighborsClassifier())])
parameters = {'lmnn__init': ['pca', 'lda', 'identity', 'random'],
              'lmnn__k':[2,3],
              'knn__n_neighbors':[2,3],
              'knn__weights': ['uniform','distance'],
              'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'knn__leaf_size': [28,30,32],
              'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
grid_lmnn_knn.score(np.array(X_test), np.array(y_test))

Output:

Fitting 2 folds for each of 1152 candidates, totalling 2304 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done 796 tasks      | elapsed: 17.6min
[Parallel(n_jobs=-1)]: Done 1246 tasks      | elapsed: 27.5min
[Parallel(n_jobs=-1)]: Done 1796 tasks      | elapsed: 39.5min
[Parallel(n_jobs=-1)]: Done 2304 out of 2304 | elapsed: 50.6min finished
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-0d2c55d1bedc> in <module>()
     63               'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
     64 grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
---> 65 grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
     66 grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
     67 

6 frames
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    734             # of the params are estimators as well.
    735             self.best_estimator_ = clone(clone(base_estimator).set_params(
--> 736                 **self.best_params_))
    737             refit_start_time = time.time()
    738             if y is not None:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     69     new_object_params = estimator.get_params(deep=False)
     70     for name, param in new_object_params.items():
---> 71         new_object_params[name] = clone(param, safe=False)
     72     new_object = klass(**new_object_params)
     73     params_set = new_object.get_params(deep=False)

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     70     for name, param in new_object_params.items():
     71         new_object_params[name] = clone(param, safe=False)
---> 72     new_object = klass(**new_object_params)
     73     params_set = new_object.get_params(deep=False)
     74 

TypeError: __init__() got an unexpected keyword argument 'init'

johny-c commented 3 years ago

Hi @angelotc , thanks for trying this out. One issue that I can see is that the number of neighbors parameter is n_neighbors instead of k to be consistent with scikit-learn naming (lmnn__k --> lmnn__n_neighbors).

angelotc commented 3 years ago

Did not realize that the parameters were renamed in this PR! Now I am getting the following, though init should be a parameter, no?:

New param grid:

parameters = {'lmnn__init': ['pca', 'lda', 'identity', 'random'],
              'lmnn__n_neighbors':[2,3],
              'knn__n_neighbors':[2,3],
              'knn__weights': ['uniform','distance'],
              'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'knn__leaf_size': [28,30,32],
              'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}

Output:


Collecting git+https://github.com/johny-c/metric-learn.git
  Cloning https://github.com/johny-c/metric-learn.git to /tmp/pip-req-build-fitppg1y
  Running command git clone -q https://github.com/johny-c/metric-learn.git /tmp/pip-req-build-fitppg1y
Requirement already satisfied (use --upgrade to upgrade): metric-learn==0.3.0 from git+https://github.com/johny-c/metric-learn.git in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (0.22.2.post1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->metric-learn==0.3.0) (1.0.1)
Building wheels for collected packages: metric-learn
  Building wheel for metric-learn (setup.py) ... done
  Created wheel for metric-learn: filename=metric_learn-0.3.0-py2.py3-none-any.whl size=20418 sha256=5f4304f1854d9baf2addf29205be74f8606b616a37bb4654e9785382741e3806
  Stored in directory: /tmp/pip-ephem-wheel-cache-l_uxezae/wheels/37/bb/37/d1f7cdc04867e7f3238ae031563cf2d4d54fa038dff40b9cd7
Successfully built metric-learn
Fitting 2 folds for each of 1152 candidates, totalling 2304 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed: 10.6min
[Parallel(n_jobs=-1)]: Done 796 tasks      | elapsed: 18.8min
[Parallel(n_jobs=-1)]: Done 1246 tasks      | elapsed: 29.4min
[Parallel(n_jobs=-1)]: Done 1796 tasks      | elapsed: 42.3min
[Parallel(n_jobs=-1)]: Done 2304 out of 2304 | elapsed: 54.2min finished
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-6e7010734268> in <module>()
     65               'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
     66 grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
---> 67 grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
     68 grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
     69 

6 frames
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    734             # of the params are estimators as well.
    735             self.best_estimator_ = clone(clone(base_estimator).set_params(
--> 736                 **self.best_params_))
    737             refit_start_time = time.time()
    738             if y is not None:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     69     new_object_params = estimator.get_params(deep=False)
     70     for name, param in new_object_params.items():
---> 71         new_object_params[name] = clone(param, safe=False)
     72     new_object = klass(**new_object_params)
     73     params_set = new_object.get_params(deep=False)

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     70     for name, param in new_object_params.items():
     71         new_object_params[name] = clone(param, safe=False)
---> 72     new_object = klass(**new_object_params)
     73     params_set = new_object.get_params(deep=False)
     74 

TypeError: __init__() got an unexpected keyword argument 'init'

johny-c commented 3 years ago

Yes, it should. I'm not sure what's going wrong here. Have you tried isolating the problem by removing all other parts, like GridCV and Pipeline?

angelotc commented 3 years ago

Tried that!


!pip install git+https://github.com/johny-c/metric-learn.git
from metric_learn import LMNN
inits = ['pca', 'lda', 'identity', 'random']
for init_param in inits:
  lmnn = LMNN(init=init_param)

Same error.

Collecting git+https://github.com/johny-c/metric-learn.git
  Cloning https://github.com/johny-c/metric-learn.git to /tmp/pip-req-build-g4w95cib
  Running command git clone -q https://github.com/johny-c/metric-learn.git /tmp/pip-req-build-g4w95cib
Requirement already satisfied (use --upgrade to upgrade): metric-learn==0.3.0 from git+https://github.com/johny-c/metric-learn.git in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (0.22.2.post1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->metric-learn==0.3.0) (1.0.1)
Building wheels for collected packages: metric-learn
  Building wheel for metric-learn (setup.py) ... done
  Created wheel for metric-learn: filename=metric_learn-0.3.0-py2.py3-none-any.whl size=20418 sha256=3a82cbd80fba81b5798d874527b3586c34e09488a2d11d28cafe649cb036c416
  Stored in directory: /tmp/pip-ephem-wheel-cache-bgggttl6/wheels/37/bb/37/d1f7cdc04867e7f3238ae031563cf2d4d54fa038dff40b9cd7
Successfully built metric-learn
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-b9501fab5074> in <module>()
      4 inits = ['pca', 'lda', 'identity', 'random']
      5 for init_param in inits:
----> 6   lmnn = LMNN(init=init_param)
      7 
      8 

TypeError: __init__() got an unexpected keyword argument 'init'

Also not sure if worth noting, but before that I was getting errors from the SDML import from init.py. So I just commented out the SDML import.

wdevazelhes commented 3 years ago

@johny-c #313 was merged, so if you merge with master the tests should go green now, and actually I'm thinking we could maybe merge the PR as is, in the end (if everyone agrees), (and deal with the singleton classes and the warning message mentioned above later) since actually the current LMNN also doesn't support singleton classes (see comment https://github.com/scikit-learn-contrib/metric-learn/pull/313#issuecomment-814762734:

the current LMNN is even more strict than the newer PR (it needs at least n_neighbors + 1 elements in each class, not just 2), so the new PR would actually allow to deal with more cases, and we could deal with singleton classes in another PR))

@angelotc thanks for investigating, I tried to run your snippet just above, and realized the pip install actually didn't install this PR but the master from @johny-c which is an old metric-learn version (so caused some bugs). Replacing your first line by !pip install git+https://github.com/johny-c/metric-learn.git@lmnn, it worked, maybe you can try it again on your full example ?

johny-c commented 3 years ago

@wdevazelhes , that's great! I need to make a few changes before pushing - the max_impostors param is crucial for memory and should be included. I'll push within the next 2-3 days.

angelotc commented 3 years ago

@wdevazelhes Wow thank you so much! Got around .93 AUC on my synthetic dataset example. Someone in my group also created a Mahalanobis distance based classifier that is similar performance. Wanted to compare this to it, and it looks promising. Hopefully this will work well on our materials datasets.

angelotc commented 3 years ago

the current LMNN is even more strict than the newer PR (it needs at least n_neighbors + 1 elements in each class, not just 2), so the new PR would actually allow to deal with more cases, and we could deal with singleton classes in another PR))

You make a good point here in which there is a minimum amount of samples needed in each class. In my binary classification example, I initially tried to tried training with 600 samples of class 1, and testing on 600 samples of class 1 and 400 samples of class 0. This doesn't work due to the nature of the algorithm, and thus I received an error. So I tried training on 600 samples of class 0, and ran additional simulations where I constantly add more samples of the other class, and these were my results:

So for x = 0.01, this means that if there were 600 samples of class 0 in the training set, 1% of the samples from the other class were also added to the training set(6 samples of class=1 in this case were added to the training set ).

johny-c commented 3 years ago

If you want to review again, it's ready from my side.

wdevazelhes commented 3 years ago

@angelotc I'm glad it worked quite well, and indeed, I guess negative pairs are very important to learn the decision frontier, interesting curve ! It's nice to see you can have a good auc with that few samples from the other class, although I'm curious why you don't have the same balance of classes between your train and test set ? (your test set seems quite balanced with 600 ones and 400 zeros, contrary to the trainset, is that due to experimental constraints for instance ?)

wdevazelhes commented 3 years ago

If you want to review again, it's ready from my side.

Thanks a lot @johny-c ! I'll try to do a review this week

angelotc commented 3 years ago

@angelotc I'm glad it worked quite well, and indeed, I guess negative pairs are very important to learn the decision frontier, interesting curve ! It's nice to see you can have a good auc with that few samples from the other class, although I'm curious why you don't have the same balance of classes between your train and test set ? (your test set seems quite balanced with 600 ones and 400 zeros, contrary to the trainset, is that due to experimental constraints for instance ?)

Hello, my academic research group is interested in the field of uncertainty quantification in materials sciences, but the materials field has small sample sizes; the dataset that I am working with only has around ~400 in-domain samples and <10 out-of-domain samples. My predecessors have actually treated this as a regression problem, and that has been published here. We sort of treat it as an unsupervised learning problem, but my PI told me that it would be fun to think of it as a supervised classification problem (which I am exploring with metric-learn). We actually had more success with just using the raw Mahalanobis distance score of each testing set sample to the training set's (in-domain samples) centroid, but still this was a fun journey.

angelotc commented 3 years ago

Coincidentally, this paper basically models what we are attempting to explore with the exception that it is in the materials field (not dialog systems though I do work with dialog systems). Pretty funny that it was written by Huawei people as well seeing that is where you work at.

wdevazelhes commented 3 years ago

@angelotc I see, thanks for the reference, this Out of Domain Detection use case of metric-learning sounds interesting indeed! Btw we just opened the github discussions section of metric-learn, https://github.com/scikit-learn-contrib/metric-learn/discussions, don't hesitate to use it if you have any remark to discuss regarding metric-learn ;)

angelotc commented 3 years ago

Hello, I am trying this out on another dataset now for superconductors cuprate classification. Running into an issue where ind_a and ind_b arrays are getting corrupted. I believe this is happening in line(s) 525-526 in the lmnn.py file.

Here is my output from the error. I can share my notebook and dataset file via email since I don't know if the prof wants me to share either publically.

Fitting 2 folds for each of 288 candidates, totalling 576 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done 368 tasks      | elapsed:   16.9s
[Parallel(n_jobs=-1)]: Done 576 out of 576 | elapsed:   25.8s finished
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-21-6c1595135885> in <module>()
     16 
     17 grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='roc_auc')
---> 18 grid_lmnn_knn = grid_lmnn_knn.fit(np.array(X_train), np.array(y_train))
     19 roc_auc_score = grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
     20 results.append(roc_auc_score)

14 frames
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    737             refit_start_time = time.time()
    738             if y is not None:
--> 739                 self.best_estimator_.fit(X, y, **fit_params)
    740             else:
    741                 self.best_estimator_.fit(X, **fit_params)

/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    348             This estimator
    349         """
--> 350         Xt, fit_params = self._fit(X, y, **fit_params)
    351         with _print_elapsed_time('Pipeline',
    352                                  self._log_message(len(self.steps) - 1)):

/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    313                 message_clsname='Pipeline',
    314                 message=self._log_message(step_idx),
--> 315                 **fit_params_steps[name])
    316             # Replace the transformer of the step with the fitted
    317             # transformer. This is necessary when loading the transformer

/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    726     with _print_elapsed_time(message_clsname, message):
    727         if hasattr(transformer, 'fit_transform'):
--> 728             res = transformer.fit_transform(X, y, **fit_params)
    729         else:
    730             res = transformer.fit(X, y, **fit_params).transform(X)

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    572         else:
    573             # fit method of arity 2 (supervised transformation)
--> 574             return self.fit(X, y, **fit_params).transform(X)
    575 
    576 

/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in fit(self, X, y)
    220     # Call the optimizer
    221     self.n_iter_ = 0
--> 222     opt_result = minimize(**optimizer_params)
    223 
    224     # Reshape the solution found by the optimizer

/usr/local/lib/python3.7/dist-packages/scipy/optimize/_minimize.py in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
    608     elif meth == 'l-bfgs-b':
    609         return _minimize_lbfgsb(fun, x0, args, jac, bounds,
--> 610                                 callback=callback, **options)
    611     elif meth == 'tnc':
    612         return _minimize_tnc(fun, x0, args, jac, bounds, callback=callback,

/usr/local/lib/python3.7/dist-packages/scipy/optimize/lbfgsb.py in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, **unknown_options)
    343             # until the completion of the current minimization iteration.
    344             # Overwrite f and g:
--> 345             f, g = func_and_grad(x)
    346         elif task_str.startswith(b'NEW_X'):
    347             # new iteration

/usr/local/lib/python3.7/dist-packages/scipy/optimize/lbfgsb.py in func_and_grad(x)
    293     else:
    294         def func_and_grad(x):
--> 295             f = fun(x, *args)
    296             g = jac(x, *args)
    297             return f, g

/usr/local/lib/python3.7/dist-packages/scipy/optimize/optimize.py in function_wrapper(*wrapper_args)
    325     def function_wrapper(*wrapper_args):
    326         ncalls[0] += 1
--> 327         return function(*(wrapper_args + args))
    328 
    329     return ncalls, function_wrapper

/usr/local/lib/python3.7/dist-packages/scipy/optimize/optimize.py in __call__(self, x, *args)
     63     def __call__(self, x, *args):
     64         self.x = numpy.asarray(x).copy()
---> 65         fg = self.fun(x, *args)
     66         self.jac = fg[1]
     67         return fg[0]

/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in _loss_grad_lbfgs(self, L, X, y, classes, target_neighbors, pull_loss_grad_m)
    340     # Find impostors
    341     impostors_graph = self._find_impostors(X_embedded, y, classes,
--> 342                                            dist_tn[:, -1])
    343 
    344     # Compute the push loss and its gradient

/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in _find_impostors(self, X_embedded, y, classes, margin_radii)
    428 
    429       ind_impostors = _find_impostors_blockwise(X_embedded, margin_radii,
--> 430                                                 ind_b, ind_a)
    431 
    432       n_impostors = len(ind_impostors)

/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in _find_impostors_blockwise(X, radii, ind_a, ind_b, return_distance, block_size)
    520   for chunk in gen_batches(n_samples_a, block_n_rows):
    521 
--> 522     dist = euclidean_distances(X[ind_a[chunk]], X_b, squared=True,
    523                                Y_norm_squared=X_b_norm_squared)
    524 

IndexError: index 947380 is out of bounds for axis 0 with size 5314

The array looks fine when being outputted from _find_impostors, but not from _find_impostors_blockwise, leading me to think this is being corrupted from ln 525/526:

> /usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py(522)_find_impostors_blockwise()
    520   for chunk in gen_batches(n_samples_a, block_n_rows):
    521 
--> 522     dist = euclidean_distances(X[ind_a[chunk]], X_b, squared=True,
    523                                Y_norm_squared=X_b_norm_squared)
    524 

ipdb> ind_a
array([   2669,    3386,    5540,    8190,   21013,   22068,   23031,
         23576,   27355,   28453,   28894,   32398,   34464,   34579,
         35427,   35972,   41765,   43127,   43962,   46115,   46596,
         47108,   47729,   48406,   59039,   61196,   64114,   77970,
         78425,   82708,   86698,   90830,   92560,   93573,   94290,
         99473,  100018,  109465,  109803,  110348,  112846,  115199,
        117224,  117679,  121861,  121941,  130463,  131008,  135818,
        136273,  145117,  151123,  151668,  157679,  158160,  161074,
        165218,  166808,  167263,  169382,  171404,  171975,  172669,
        173073,  173514,  179810,  184376,  190675,  191392,  195356,
        197899,  201930,  202385,  202536,  206062,  206517,  207223,
        207934,  208636,  209163,  209289,  209857,  210000,  210702,
        214834,  217141,  217818,  224280,  224685,  227186,  231318,
        233428,  233955,  234081,  234649,  234792,  235053,  235494,
        236187,  236668,  238213,  238924,  239626,  240040,  241327,
        243714,  247382,  247837,  250483,  251581,  251759,  254088,
        255400,  257336,  257741,  261945,  263951,  265051,  265768,
        268042,  268497,  268885,  269430,  270951,  271496,  279215,
        279760,  288326,  292458,  295364,  297857,  298127,  299540,
        305110,  305225,  310476,  312271,  312816,  314079,  315132,
        316166,  317250,  318609,  319971,  328420,  330600,  332367,
        336220,  338794,  340958,  343003,  344056,  345625,  346342,
        348496,  358511,  358738,  361476,  361934,  361995,  363221,
        363357,  365144,  365599,  368371,  369784,  377785,  382148,
        386268,  395103,  395797,  396201,  402938,  404022,  405599,
        406080,  406592,  409956,  411974,  417829,  418927,  419466,
        422992,  423447,  427632,  429654,  430880,  431384,  433786,
        435388,  435843,  442050,  445759,  455408,  458619,  459275,
        460688,  463089,  463634,  469479,  470577,  471018,  475248,
        481304,  491050,  491165,  493744,  494079,  494624,  497832,
        497974,  504409,  504954,  507859,  507970,  508541,  509086,
        512673,  513218,  519229,  519710,  525786,  526012,  528238,
        528353,  529519,  530932,  531625,  532106,  534556,  535011,
        535539,  536901,  538211,  538253,  538480,  538967,  545350,
        546952,  547407,  549482,  549861,  550406,  552774,  553179,
        553756,  556002,  561414,  563143,  564641,  565209,  565352,
        566054,  567236,  568773,  569341,  569484,  570186,  571333,
        572267,  574701,  574971,  575539,  576384,  578785,  579330,
        581169,  581737,  581880,  582582,  582680,  585301,  586714,
        588736,  589163,  590218,  592284,  595505,  597379,  597924,
        601829,  603242,  605835,  606919,  606933,  607374,  609854,
        610823,  611141,  614225,  614793,  614936,  615638,  616820,
        618231,  618357,  618925,  619068,  619329,  619770,  622529,
        623010,  626303,  626848,  628111,  638364,  638839,  639445,
        640201,  645195,  645912,  649347,  649915,  650760,  651883,
        652110,  660462,  665178,  674924,  675039,  675887,  676432,
        679684,  680317,  681034,  684963,  685897,  686357,  688601,
        689169,  689312,  690014,  690112,  693262,  694481,  695026,
        697770,  698225,  702487,  703540,  705003,  705697,  706101,
        711327,  711895,  712038,  712740,  714762,  718430,  718885,
        723026,  723597,  723723,  724291,  724434,  724695,  725136,
        725226,  726002,  726283,  727217,  727835,  730826,  731281,
        732516,  735993,  736119,  736687,  736830,  737091,  737532,
        738165,  738655,  738882,  739218,  743222,  743677,  748495,
        749212,  752687,  753168,  754693,  755410,  757564,  757679,
        760255,  762851,  762977,  763545,  763688,  763949,  764390,
        764725,  765270,  768067,  772146,  772601,  777957,  779235,
        780248,  780290,  781253,  781798,  783367,  783637,  785050,
        787499,  788554,  789517,  790062,  791662,  791723,  792949,
        793085,  797183,  799847,  800392,  802211,  802928,  809214,
        809329,  813974,  814449,  815055,  815324,  816375,  816920,
        820249,  821771,  822122,  824304,  824779,  826141,  826705,
        827250,  828819,  829832,  829874,  830885,  831940,  834590,
        834969,  835514,  837175,  838537,  840832,  841359,  842443,
        842457,  845439,  846801,  847683,  849096,  850492,  850534,
        851497,  852042,  854410,  857174,  861448,  864639,  864681,
        864908,  866494,  867310,  867648,  869248,  869703,  872349,
        872475,  873043,  873186,  873447,  873888,  875446,  875901,
        878355,  878900,  882535,  882805,  883373,  884218,  886252,
        890153,  897247,  897964,  898300,  900233,  900725,  903339,
        904437,  910062,  910288,  914060,  915841,  916078,  917609,
        918154,  920487,  921421,  922270,  932957,  937264,  937306,
        944765,  945002,  945818,  947380,  949875,  952779,  953049,
        954462,  959227,  959464,  960280,  961187,  961313,  961881,
        962024,  962285,  962726,  963253,  963379,  963947,  964090,
        964351,  964792,  966188,  966230,  968411,  970990,  973012,
        973689,  974406,  975199,  983386,  984939,  990111,  990237,
        990805,  990948,  991650,  993208,  993663,  996309,  997407,
        997585,  998995,  999406,  999861,  999929, 1005604, 1006059,
       1006745, 1007462, 1011426, 1011831, 1012785, 1017995, 1026144,
       1026259, 1030020, 1034474, 1039124, 1039245, 1043714, 1045001,
       1046762, 1046804])
ipdb> ind_b
array([ 80584,  80713,  80727,  80830,  80989,  81022,  81155,  81156,
        81386,  81417,  81542,  81544,  81589,  81654,  81655,  81717,
        81731,  81740,  81744,  81770,  81875,  81893,  81935,  81995,
        82005,  82009,  82034,  82073,  82122,  82237,  82256,  82260,
        82290,  82307,  82310,  82318,  82429,  82563, 261040, 261945,
       447398, 448136, 455408, 506794, 525096, 525106, 525786, 525858,
       526012, 526196, 526639, 526686, 698398, 698430, 698545, 698596,
       698608, 698667, 698687, 698771, 698851, 698884, 699040, 699103,
       699122, 699132, 699153, 699199, 699350, 699399, 699437, 699715,
       699737, 699976, 699981, 700117, 700150, 700313, 822122, 909129,
       909443, 909907, 910062, 910127, 910288, 910806, 911031, 966188])
ipdb> up
> /usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py(430)_find_impostors()
    428 
    429       ind_impostors = _find_impostors_blockwise(X_embedded, margin_radii,
--> 430                                                 ind_b, ind_a)
    431 
    432       n_impostors = len(ind_impostors)

ipdb> ind_a
array([   1,    3,    5, ..., 5298, 5306, 5313])
ipdb> ind_b
array([   0,    2,    4, ..., 5310, 5311, 5312])
ipdb>

johny-c commented 3 years ago

Hi @angelotc , thanks for catching this. I think I know what's wrong. I changed something in the last commit, but it's easy to fix.

wdevazelhes commented 3 years ago

@angelotc, did you try the new commit from @johny-c on your example again ? It would be interesting to know if it works now

angelotc commented 3 years ago

@angelotc, did you try the new commit from @johny-c on your example again ? It would be interesting to know if it works now

Yes, the algorithm performs well with .98 AUC with distinguishing cuprates from non-cuprates on our superconductors dataset. However, my research group is more interested in unsupervised methods; e.g. if we train the model on in-domain samples, how well does it perform on detecting the risk of out-of-domain samples? Out-of-domain samples can be of different classes. Through my research, Isolation Forest does not work well since it's highly correlated with the euclidian distance from the centroid. So we have started visiting the raw mahalanobis distance to the centroid of the training set, which works well for diffusion dataset, but not for our superconductors datasets. I do see that there is the Covariance method from the metric-learn module which I will explore in the next few weeks.

wdevazelhes commented 3 years ago

Yes, the algorithm performs well with .98 AUC with distinguishing cuprates from non-cuprates on our superconductors dataset. However, my research group is more interested in unsupervised methods; e.g. if we train the model on in-domain samples, how well does it perform on detecting the risk of out-of-domain samples? Out-of-domain samples can be of different classes. Through my research, Isolation Forest does not work well since it's highly correlated with the euclidian distance from the centroid. So we have started visiting the raw mahalanobis distance to the centroid of the training set, which works well for diffusion dataset, but not for our superconductors datasets. I do see that there is the Covariance method from the metric-learn module which I will explore in the next few weeks.

@angelotc Sorry for the very late reply, indeed, the Covariance method can be interesting, since it returns a raw Mahalanobis metric computed from the training samples, so if it works well as a metric to the centroid in some cases, maybe it would even work well as a general metric (like for computing distances of test samples to the closest sample from the tranining set, not necessarily the centroid, to see if this test sample is an outlier or not)

Also, Covariance is currently the only truly unsupervised metric learning algorithm in metric-learn

wdevazelhes commented 3 years ago

@johny-c thanks a lot for the commits Sorry for my very late reply, everything looks good to me, there's just this tiny point about the weights in the cost function (see comment https://github.com/scikit-learn-contrib/metric-learn/pull/309#discussion_r681524257), let me know what you think, as well as the extra test (https://github.com/scikit-learn-contrib/metric-learn/pull/309#discussion_r681592065), and it should be good !

johny-c commented 3 years ago

Hi @wdevazelhes , I added back the test with more samples and removing the callbacks part. Regarding the push loss weight, see my reply above. I guess I can more clearly state what happens in the docstring. Other than that, is it good to go?

bellet commented 3 years ago

Dear @johny-c, thanks a lot for the great work on this. Sorry I haven't had time yet to take a close look at this but I will try to do this soon.

I don't remember if you provided somewhere some empirical comparison to the current implementation, to show for instance that it is faster in most cases while achieving similar accuracy? Of course it is quite clear that your implementation is more clear and easier to maintain :-)

bellet commented 3 years ago

Also if I'm not mistaken this implementation solves a slightly different optimization problem (the hinge loss is squared), can you confirm @johny-c? If so, we should probably add a note about this in the documentation.

johny-c commented 3 years ago

Hi @bellet , thanks! Yes, the objective here uses squared distances as in the paper (https://proceedings.neurips.cc/paper/2005/file/a7f592cef8b130a6967a90617db5681b-Paper.pdf). I haven't done any comparisons to the previous version. Do you have any specific data sets in mind? I can do this, but I'm not sure when I will have free time.

bellet commented 3 years ago

Hi @bellet , thanks! Yes, the objective here uses squared distances as in the paper (https://proceedings.neurips.cc/paper/2005/file/a7f592cef8b130a6967a90617db5681b-Paper.pdf). I haven't done any comparisons to the previous version. Do you have any specific data sets in mind? I can do this, but I'm not sure when I will have free time.

Hi @johny-c, I was referring to the use of the squared hinge loss (that is, a max squared instead of a max in the constraint/loss), which is needed to make the loss differentiable.

bellet commented 3 years ago

@johny-c for a quick benchmark, I think a few standard classification datasets (like the ones in sklearn) would be sufficient. Just to show that the solutions found by this implementation lead to roughly the same accuracy in KNN

scikit-learn-contrib / metric-learn

New LMNN version #309