Open johny-c opened 3 years ago
Thanks for the PR @johny-c ! 👍
The failures seem to happen because test_auto_init_transformation
creates toy examples with very few samples, and it happens that they contain singleton classes, which are removed by the method _validate_params
from this PR, so then they don't have the right format (e.g. number of classes) for the initializations like LDA
Putting the line X, y, classes = self._validate_params(X, y)
just after self.components_ = _initialize_components...
instead of after X, y = self._prepare_inputs(X....
seems to solve the pb. That is, initializing the components_
with the original data, and pruning it only for the processing done for LMNN
itself. However this would change a bit the behaviour of the initialization (which, after this modif, would take into account the singleton classes). But I think it makes sense, if we are able to fit such an init with singleton classes, to do so, no ?
That said, I'm wondering if we shouldn't allow the singleton classes to be taken also into account in LMNN ? We can't form similar pairs from those samples (since there's just 1), but we could maybe use them in dissimilar pairs, no ? (If we don't take them into account nothing will prevent those singletons to be inside a cluster of points from a different class)
PS: the CI tests are failing because we need to update the repo to match the new scikit-learn API (see #311), we'll fix that soon
Hi @wdevazelhes,
thanks for spotting the issue. I think the first solution would be best, namely initializing the components with all samples and then removing the singleton class samples. I think this can only negatively affect the algorithm if the initialization uses class labels; namely until now only LDA.
Regarding taking into account singleton classes in LMNN as negative samples: I guess this could be done, but I think it would make the algorithm and code more complex and I'm uncertain about any potential gains. Arguably, if one intends to use an algorithm that works with triplets, they would make sure that each class has at least 2 samples. I would even argue that, throwing an exception if this is not satisfied, would make sense. In the PR in scikit-learn
, the solution suggested and implemented was to just remove the singleton samples from the beginning.
@johny-c, thanks for updating the PR, I should have a bit more time by the middle of next week to fix #311, so that we could see the checks here go green,
About taking into account singleton classes as negative examples, I'm bothered for instance by use cases like person re-identification which is a use case for which many users are using metric-learn
(https://github.com/scikit-learn-contrib/metric-learn/network/dependents?package_id=UGFja2FnZS01MjI5NjIxOA%3D%3D). In those cases, I think it would still be useful to take singleton classes into account, because there can be a lot of them: for instance the "labeled faces in the wild" dataset contains 5749 people in total but only 1680 people with two or more images, (see http://vis-www.cs.umass.edu/lfw/#information), therefore, in this case dropping singleton cases would mean droping 4069 samples... I fear that in this case dropping the singleton classes could cause a significative drop in performance...
@wdevazelhes, that's a good point. In practice, this can be done by considering classes
to be the non-singleton classes when iterating over them in find_impostors
and considering impostor candidates with y != class_id
instead of y > class_id
. The issue with this change will be in terms of efficiency. I had tried this version in an early version of the scikit-learn
PR and it makes the code about half as fast (which makes sense). Maybe, this should be controlled by another user parameter (e.g. use_singletons
)? I'm not sure.
@johny-c good question, I guess indeed it would be problematic if LMNN is too long to run,
But I think that dropping only singleton classes would modify the training data in a biased way (i.e. the modified training data would not really be a true iid "subsample" of the original data anymore), so maybe instead we would rather want to train on a random subsample from the original data ? If so we could let the LMNN as-is (the version that keeps singleton classes), and let the users subsample their data by themselves.
However this is maybe not ideal because in this case if there were a lot of classes with just two elements they may become singleton classes and therefore we may lose many points for which we can form target neighbors...
In the end, in order to keep the code simple, and keep the data as much as possible as is, I'm thinking maybe we could drop a warning if there are a lot of singleton classes, sth like The dataset contains many singleton classes (XX, or YY% of the dataset), which will only be used when computing impostors. Training may be longer because of them, you may want to drop some of them to make the training faster.
? Then it's up to the users to use their own strategy for dropping singleton classes ?
But I'm not really sure...
@bellet, @perimosocordiae , @terrytangyuan , @nvauquie , what do you think about this ?
Yes, I think a warning is fine in this case to keep the code less complex. We should probably not modify the original data and leave this up to the user to clean/decide what strategy to use.
Hello! When trying your PR, I get the following error. Ran from Google Collab:
Code:
from sklearn.datasets import make_friedman1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
!pip install git+https://github.com/johny-c/metric-learn.git
def friedman_np_to_df(X,y):
return pd.DataFrame(X,columns=['x0','x1', 'x2', 'x3', 'x4']), pd.Series(y)
# Make training set. We don't care about Y so call it NA.
X_train, NA = make_friedman1(n_samples=1000, n_features=5, random_state = 1)
X_train, NA = friedman_np_to_df(X_train,NA)
#categorize training set based off of x0
domain_list = []
for i in range(len(X_train)):
if X_train.iloc[i]['x0'] < 0.6 :
domain_list.append(1)
else:
domain_list.append(0)
X_train['domain'] = domain_list
# Set training set to where domain == 1 (x0 < 0.6), but also add ~60 out-of-domain samples (X_train['domain'] == 1 )
out_of_domain = X_train[X_train['domain'] == 0][:60]
X_train = X_train[X_train['domain']==1]
X_train = pd.concat([out_of_domain, X_train])
y_train = X_train.copy()
X_train = X_train.drop(columns = ['domain'])
y_train = y_train['domain']
# Make testing set with a different random_state
X_test, NA2 = make_friedman1(n_samples=1000, n_features=5, random_state = 3)
X_test, NA2 = friedman_np_to_df(X_test,NA2)
#categorize testing set based off of x0
domain_list = []
for i in range(len(X_test)):
if X_test.iloc[i]['x0'] < 0.6:
domain_list.append(1)
else:
domain_list.append(0)
X_test['domain'] = domain_list
y_test = X_test['domain'].copy()
X_test = X_test.drop(columns = ['domain'])
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from metric_learn import LMNN
lmnn_knn = Pipeline(steps=[('lmnn', LMNN()), ('knn', KNeighborsClassifier())])
parameters = {'lmnn__init': ['pca', 'lda', 'identity', 'random'],
'lmnn__k':[2,3],
'knn__n_neighbors':[2,3],
'knn__weights': ['uniform','distance'],
'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'knn__leaf_size': [28,30,32],
'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
Output:
Fitting 2 folds for each of 1152 candidates, totalling 2304 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 1.0min
[Parallel(n_jobs=-1)]: Done 196 tasks | elapsed: 4.3min
[Parallel(n_jobs=-1)]: Done 446 tasks | elapsed: 9.9min
[Parallel(n_jobs=-1)]: Done 796 tasks | elapsed: 17.6min
[Parallel(n_jobs=-1)]: Done 1246 tasks | elapsed: 27.5min
[Parallel(n_jobs=-1)]: Done 1796 tasks | elapsed: 39.5min
[Parallel(n_jobs=-1)]: Done 2304 out of 2304 | elapsed: 50.6min finished
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-0d2c55d1bedc> in <module>()
63 'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
64 grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
---> 65 grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
66 grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
67
6 frames
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
734 # of the params are estimators as well.
735 self.best_estimator_ = clone(clone(base_estimator).set_params(
--> 736 **self.best_params_))
737 refit_start_time = time.time()
738 if y is not None:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
69 new_object_params = estimator.get_params(deep=False)
70 for name, param in new_object_params.items():
---> 71 new_object_params[name] = clone(param, safe=False)
72 new_object = klass(**new_object_params)
73 params_set = new_object.get_params(deep=False)
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
57 # XXX: not handling dictionaries
58 if estimator_type in (list, tuple, set, frozenset):
---> 59 return estimator_type([clone(e, safe=safe) for e in estimator])
60 elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
61 if not safe:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
57 # XXX: not handling dictionaries
58 if estimator_type in (list, tuple, set, frozenset):
---> 59 return estimator_type([clone(e, safe=safe) for e in estimator])
60 elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
61 if not safe:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
57 # XXX: not handling dictionaries
58 if estimator_type in (list, tuple, set, frozenset):
---> 59 return estimator_type([clone(e, safe=safe) for e in estimator])
60 elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
61 if not safe:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
57 # XXX: not handling dictionaries
58 if estimator_type in (list, tuple, set, frozenset):
---> 59 return estimator_type([clone(e, safe=safe) for e in estimator])
60 elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
61 if not safe:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
70 for name, param in new_object_params.items():
71 new_object_params[name] = clone(param, safe=False)
---> 72 new_object = klass(**new_object_params)
73 params_set = new_object.get_params(deep=False)
74
TypeError: __init__() got an unexpected keyword argument 'init'
Hi @angelotc , thanks for trying this out. One issue that I can see is that the number of neighbors parameter is n_neighbors
instead of k
to be consistent with scikit-learn
naming (lmnn__k
--> lmnn__n_neighbors
).
Did not realize that the parameters were renamed in this PR! Now I am getting the following, though init should be a parameter, no?:
New param grid:
parameters = {'lmnn__init': ['pca', 'lda', 'identity', 'random'],
'lmnn__n_neighbors':[2,3],
'knn__n_neighbors':[2,3],
'knn__weights': ['uniform','distance'],
'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'knn__leaf_size': [28,30,32],
'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
Output:
Collecting git+https://github.com/johny-c/metric-learn.git
Cloning https://github.com/johny-c/metric-learn.git to /tmp/pip-req-build-fitppg1y
Running command git clone -q https://github.com/johny-c/metric-learn.git /tmp/pip-req-build-fitppg1y
Requirement already satisfied (use --upgrade to upgrade): metric-learn==0.3.0 from git+https://github.com/johny-c/metric-learn.git in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (0.22.2.post1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->metric-learn==0.3.0) (1.0.1)
Building wheels for collected packages: metric-learn
Building wheel for metric-learn (setup.py) ... done
Created wheel for metric-learn: filename=metric_learn-0.3.0-py2.py3-none-any.whl size=20418 sha256=5f4304f1854d9baf2addf29205be74f8606b616a37bb4654e9785382741e3806
Stored in directory: /tmp/pip-ephem-wheel-cache-l_uxezae/wheels/37/bb/37/d1f7cdc04867e7f3238ae031563cf2d4d54fa038dff40b9cd7
Successfully built metric-learn
Fitting 2 folds for each of 1152 candidates, totalling 2304 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 1.1min
[Parallel(n_jobs=-1)]: Done 196 tasks | elapsed: 4.6min
[Parallel(n_jobs=-1)]: Done 446 tasks | elapsed: 10.6min
[Parallel(n_jobs=-1)]: Done 796 tasks | elapsed: 18.8min
[Parallel(n_jobs=-1)]: Done 1246 tasks | elapsed: 29.4min
[Parallel(n_jobs=-1)]: Done 1796 tasks | elapsed: 42.3min
[Parallel(n_jobs=-1)]: Done 2304 out of 2304 | elapsed: 54.2min finished
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-6e7010734268> in <module>()
65 'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
66 grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
---> 67 grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
68 grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
69
6 frames
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
734 # of the params are estimators as well.
735 self.best_estimator_ = clone(clone(base_estimator).set_params(
--> 736 **self.best_params_))
737 refit_start_time = time.time()
738 if y is not None:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
69 new_object_params = estimator.get_params(deep=False)
70 for name, param in new_object_params.items():
---> 71 new_object_params[name] = clone(param, safe=False)
72 new_object = klass(**new_object_params)
73 params_set = new_object.get_params(deep=False)
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
57 # XXX: not handling dictionaries
58 if estimator_type in (list, tuple, set, frozenset):
---> 59 return estimator_type([clone(e, safe=safe) for e in estimator])
60 elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
61 if not safe:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
57 # XXX: not handling dictionaries
58 if estimator_type in (list, tuple, set, frozenset):
---> 59 return estimator_type([clone(e, safe=safe) for e in estimator])
60 elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
61 if not safe:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
57 # XXX: not handling dictionaries
58 if estimator_type in (list, tuple, set, frozenset):
---> 59 return estimator_type([clone(e, safe=safe) for e in estimator])
60 elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
61 if not safe:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
57 # XXX: not handling dictionaries
58 if estimator_type in (list, tuple, set, frozenset):
---> 59 return estimator_type([clone(e, safe=safe) for e in estimator])
60 elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
61 if not safe:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
70 for name, param in new_object_params.items():
71 new_object_params[name] = clone(param, safe=False)
---> 72 new_object = klass(**new_object_params)
73 params_set = new_object.get_params(deep=False)
74
TypeError: __init__() got an unexpected keyword argument 'init'
Yes, it should. I'm not sure what's going wrong here. Have you tried isolating the problem by removing all other parts, like GridCV
and Pipeline
?
Tried that!
!pip install git+https://github.com/johny-c/metric-learn.git
from metric_learn import LMNN
inits = ['pca', 'lda', 'identity', 'random']
for init_param in inits:
lmnn = LMNN(init=init_param)
Same error.
Collecting git+https://github.com/johny-c/metric-learn.git
Cloning https://github.com/johny-c/metric-learn.git to /tmp/pip-req-build-g4w95cib
Running command git clone -q https://github.com/johny-c/metric-learn.git /tmp/pip-req-build-g4w95cib
Requirement already satisfied (use --upgrade to upgrade): metric-learn==0.3.0 from git+https://github.com/johny-c/metric-learn.git in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (0.22.2.post1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->metric-learn==0.3.0) (1.0.1)
Building wheels for collected packages: metric-learn
Building wheel for metric-learn (setup.py) ... done
Created wheel for metric-learn: filename=metric_learn-0.3.0-py2.py3-none-any.whl size=20418 sha256=3a82cbd80fba81b5798d874527b3586c34e09488a2d11d28cafe649cb036c416
Stored in directory: /tmp/pip-ephem-wheel-cache-bgggttl6/wheels/37/bb/37/d1f7cdc04867e7f3238ae031563cf2d4d54fa038dff40b9cd7
Successfully built metric-learn
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-b9501fab5074> in <module>()
4 inits = ['pca', 'lda', 'identity', 'random']
5 for init_param in inits:
----> 6 lmnn = LMNN(init=init_param)
7
8
TypeError: __init__() got an unexpected keyword argument 'init'
Also not sure if worth noting, but before that I was getting errors from the SDML import from init.py. So I just commented out the SDML import.
@johny-c #313 was merged, so if you merge with master the tests should go green now, and actually I'm thinking we could maybe merge the PR as is, in the end (if everyone agrees), (and deal with the singleton classes and the warning message mentioned above later) since actually the current LMNN also doesn't support singleton classes (see comment https://github.com/scikit-learn-contrib/metric-learn/pull/313#issuecomment-814762734:
the current LMNN is even more strict than the newer PR (it needs at least n_neighbors + 1 elements in each class, not just 2), so the new PR would actually allow to deal with more cases, and we could deal with singleton classes in another PR))
@angelotc thanks for investigating, I tried to run your snippet just above, and realized the pip install actually didn't install this PR but the master from @johny-c which is an old metric-learn
version (so caused some bugs). Replacing your first line by !pip install git+https://github.com/johny-c/metric-learn.git@lmnn
, it worked, maybe you can try it again on your full example ?
@wdevazelhes , that's great! I need to make a few changes before pushing - the max_impostors
param is crucial for memory and should be included. I'll push within the next 2-3 days.
@wdevazelhes Wow thank you so much! Got around .93 AUC on my synthetic dataset example. Someone in my group also created a Mahalanobis distance based classifier that is similar performance. Wanted to compare this to it, and it looks promising. Hopefully this will work well on our materials datasets.
the current LMNN is even more strict than the newer PR (it needs at least n_neighbors + 1 elements in each class, not just 2), so the new PR would actually allow to deal with more cases, and we could deal with singleton classes in another PR))
You make a good point here in which there is a minimum amount of samples needed in each class. In my binary classification example, I initially tried to tried training with 600 samples of class 1, and testing on 600 samples of class 1 and 400 samples of class 0. This doesn't work due to the nature of the algorithm, and thus I received an error. So I tried training on 600 samples of class 0, and ran additional simulations where I constantly add more samples of the other class, and these were my results:
So for x = 0.01, this means that if there were 600 samples of class 0 in the training set, 1% of the samples from the other class were also added to the training set(6 samples of class=1 in this case were added to the training set ).
If you want to review again, it's ready from my side.
@angelotc I'm glad it worked quite well, and indeed, I guess negative pairs are very important to learn the decision frontier, interesting curve ! It's nice to see you can have a good auc with that few samples from the other class, although I'm curious why you don't have the same balance of classes between your train and test set ? (your test set seems quite balanced with 600 ones and 400 zeros, contrary to the trainset, is that due to experimental constraints for instance ?)
If you want to review again, it's ready from my side.
Thanks a lot @johny-c ! I'll try to do a review this week
@angelotc I'm glad it worked quite well, and indeed, I guess negative pairs are very important to learn the decision frontier, interesting curve ! It's nice to see you can have a good auc with that few samples from the other class, although I'm curious why you don't have the same balance of classes between your train and test set ? (your test set seems quite balanced with 600 ones and 400 zeros, contrary to the trainset, is that due to experimental constraints for instance ?)
Hello, my academic research group is interested in the field of uncertainty quantification in materials sciences, but the materials field has small sample sizes; the dataset that I am working with only has around ~400 in-domain samples and <10 out-of-domain samples. My predecessors have actually treated this as a regression problem, and that has been published here. We sort of treat it as an unsupervised learning problem, but my PI told me that it would be fun to think of it as a supervised classification problem (which I am exploring with metric-learn). We actually had more success with just using the raw Mahalanobis distance score of each testing set sample to the training set's (in-domain samples) centroid, but still this was a fun journey.
Coincidentally, this paper basically models what we are attempting to explore with the exception that it is in the materials field (not dialog systems though I do work with dialog systems). Pretty funny that it was written by Huawei people as well seeing that is where you work at.
@angelotc I see, thanks for the reference, this Out of Domain Detection use case of metric-learning sounds interesting indeed!
Btw we just opened the github discussions section of metric-learn
, https://github.com/scikit-learn-contrib/metric-learn/discussions, don't hesitate to use it if you have any remark to discuss regarding metric-learn
;)
Hello, I am trying this out on another dataset now for superconductors cuprate classification. Running into an issue where ind_a and ind_b arrays are getting corrupted. I believe this is happening in line(s) 525-526 in the lmnn.py file.
Here is my output from the error. I can share my notebook and dataset file via email since I don't know if the prof wants me to share either publically.
Fitting 2 folds for each of 288 candidates, totalling 576 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 68 tasks | elapsed: 4.0s
[Parallel(n_jobs=-1)]: Done 368 tasks | elapsed: 16.9s
[Parallel(n_jobs=-1)]: Done 576 out of 576 | elapsed: 25.8s finished
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-21-6c1595135885> in <module>()
16
17 grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='roc_auc')
---> 18 grid_lmnn_knn = grid_lmnn_knn.fit(np.array(X_train), np.array(y_train))
19 roc_auc_score = grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
20 results.append(roc_auc_score)
14 frames
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
737 refit_start_time = time.time()
738 if y is not None:
--> 739 self.best_estimator_.fit(X, y, **fit_params)
740 else:
741 self.best_estimator_.fit(X, **fit_params)
/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
348 This estimator
349 """
--> 350 Xt, fit_params = self._fit(X, y, **fit_params)
351 with _print_elapsed_time('Pipeline',
352 self._log_message(len(self.steps) - 1)):
/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
313 message_clsname='Pipeline',
314 message=self._log_message(step_idx),
--> 315 **fit_params_steps[name])
316 # Replace the transformer of the step with the fitted
317 # transformer. This is necessary when loading the transformer
/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
726 with _print_elapsed_time(message_clsname, message):
727 if hasattr(transformer, 'fit_transform'):
--> 728 res = transformer.fit_transform(X, y, **fit_params)
729 else:
730 res = transformer.fit(X, y, **fit_params).transform(X)
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
572 else:
573 # fit method of arity 2 (supervised transformation)
--> 574 return self.fit(X, y, **fit_params).transform(X)
575
576
/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in fit(self, X, y)
220 # Call the optimizer
221 self.n_iter_ = 0
--> 222 opt_result = minimize(**optimizer_params)
223
224 # Reshape the solution found by the optimizer
/usr/local/lib/python3.7/dist-packages/scipy/optimize/_minimize.py in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
608 elif meth == 'l-bfgs-b':
609 return _minimize_lbfgsb(fun, x0, args, jac, bounds,
--> 610 callback=callback, **options)
611 elif meth == 'tnc':
612 return _minimize_tnc(fun, x0, args, jac, bounds, callback=callback,
/usr/local/lib/python3.7/dist-packages/scipy/optimize/lbfgsb.py in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, **unknown_options)
343 # until the completion of the current minimization iteration.
344 # Overwrite f and g:
--> 345 f, g = func_and_grad(x)
346 elif task_str.startswith(b'NEW_X'):
347 # new iteration
/usr/local/lib/python3.7/dist-packages/scipy/optimize/lbfgsb.py in func_and_grad(x)
293 else:
294 def func_and_grad(x):
--> 295 f = fun(x, *args)
296 g = jac(x, *args)
297 return f, g
/usr/local/lib/python3.7/dist-packages/scipy/optimize/optimize.py in function_wrapper(*wrapper_args)
325 def function_wrapper(*wrapper_args):
326 ncalls[0] += 1
--> 327 return function(*(wrapper_args + args))
328
329 return ncalls, function_wrapper
/usr/local/lib/python3.7/dist-packages/scipy/optimize/optimize.py in __call__(self, x, *args)
63 def __call__(self, x, *args):
64 self.x = numpy.asarray(x).copy()
---> 65 fg = self.fun(x, *args)
66 self.jac = fg[1]
67 return fg[0]
/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in _loss_grad_lbfgs(self, L, X, y, classes, target_neighbors, pull_loss_grad_m)
340 # Find impostors
341 impostors_graph = self._find_impostors(X_embedded, y, classes,
--> 342 dist_tn[:, -1])
343
344 # Compute the push loss and its gradient
/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in _find_impostors(self, X_embedded, y, classes, margin_radii)
428
429 ind_impostors = _find_impostors_blockwise(X_embedded, margin_radii,
--> 430 ind_b, ind_a)
431
432 n_impostors = len(ind_impostors)
/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in _find_impostors_blockwise(X, radii, ind_a, ind_b, return_distance, block_size)
520 for chunk in gen_batches(n_samples_a, block_n_rows):
521
--> 522 dist = euclidean_distances(X[ind_a[chunk]], X_b, squared=True,
523 Y_norm_squared=X_b_norm_squared)
524
IndexError: index 947380 is out of bounds for axis 0 with size 5314
The array looks fine when being outputted from _find_impostors, but not from _find_impostors_blockwise, leading me to think this is being corrupted from ln 525/526:
> /usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py(522)_find_impostors_blockwise()
520 for chunk in gen_batches(n_samples_a, block_n_rows):
521
--> 522 dist = euclidean_distances(X[ind_a[chunk]], X_b, squared=True,
523 Y_norm_squared=X_b_norm_squared)
524
ipdb> ind_a
array([ 2669, 3386, 5540, 8190, 21013, 22068, 23031,
23576, 27355, 28453, 28894, 32398, 34464, 34579,
35427, 35972, 41765, 43127, 43962, 46115, 46596,
47108, 47729, 48406, 59039, 61196, 64114, 77970,
78425, 82708, 86698, 90830, 92560, 93573, 94290,
99473, 100018, 109465, 109803, 110348, 112846, 115199,
117224, 117679, 121861, 121941, 130463, 131008, 135818,
136273, 145117, 151123, 151668, 157679, 158160, 161074,
165218, 166808, 167263, 169382, 171404, 171975, 172669,
173073, 173514, 179810, 184376, 190675, 191392, 195356,
197899, 201930, 202385, 202536, 206062, 206517, 207223,
207934, 208636, 209163, 209289, 209857, 210000, 210702,
214834, 217141, 217818, 224280, 224685, 227186, 231318,
233428, 233955, 234081, 234649, 234792, 235053, 235494,
236187, 236668, 238213, 238924, 239626, 240040, 241327,
243714, 247382, 247837, 250483, 251581, 251759, 254088,
255400, 257336, 257741, 261945, 263951, 265051, 265768,
268042, 268497, 268885, 269430, 270951, 271496, 279215,
279760, 288326, 292458, 295364, 297857, 298127, 299540,
305110, 305225, 310476, 312271, 312816, 314079, 315132,
316166, 317250, 318609, 319971, 328420, 330600, 332367,
336220, 338794, 340958, 343003, 344056, 345625, 346342,
348496, 358511, 358738, 361476, 361934, 361995, 363221,
363357, 365144, 365599, 368371, 369784, 377785, 382148,
386268, 395103, 395797, 396201, 402938, 404022, 405599,
406080, 406592, 409956, 411974, 417829, 418927, 419466,
422992, 423447, 427632, 429654, 430880, 431384, 433786,
435388, 435843, 442050, 445759, 455408, 458619, 459275,
460688, 463089, 463634, 469479, 470577, 471018, 475248,
481304, 491050, 491165, 493744, 494079, 494624, 497832,
497974, 504409, 504954, 507859, 507970, 508541, 509086,
512673, 513218, 519229, 519710, 525786, 526012, 528238,
528353, 529519, 530932, 531625, 532106, 534556, 535011,
535539, 536901, 538211, 538253, 538480, 538967, 545350,
546952, 547407, 549482, 549861, 550406, 552774, 553179,
553756, 556002, 561414, 563143, 564641, 565209, 565352,
566054, 567236, 568773, 569341, 569484, 570186, 571333,
572267, 574701, 574971, 575539, 576384, 578785, 579330,
581169, 581737, 581880, 582582, 582680, 585301, 586714,
588736, 589163, 590218, 592284, 595505, 597379, 597924,
601829, 603242, 605835, 606919, 606933, 607374, 609854,
610823, 611141, 614225, 614793, 614936, 615638, 616820,
618231, 618357, 618925, 619068, 619329, 619770, 622529,
623010, 626303, 626848, 628111, 638364, 638839, 639445,
640201, 645195, 645912, 649347, 649915, 650760, 651883,
652110, 660462, 665178, 674924, 675039, 675887, 676432,
679684, 680317, 681034, 684963, 685897, 686357, 688601,
689169, 689312, 690014, 690112, 693262, 694481, 695026,
697770, 698225, 702487, 703540, 705003, 705697, 706101,
711327, 711895, 712038, 712740, 714762, 718430, 718885,
723026, 723597, 723723, 724291, 724434, 724695, 725136,
725226, 726002, 726283, 727217, 727835, 730826, 731281,
732516, 735993, 736119, 736687, 736830, 737091, 737532,
738165, 738655, 738882, 739218, 743222, 743677, 748495,
749212, 752687, 753168, 754693, 755410, 757564, 757679,
760255, 762851, 762977, 763545, 763688, 763949, 764390,
764725, 765270, 768067, 772146, 772601, 777957, 779235,
780248, 780290, 781253, 781798, 783367, 783637, 785050,
787499, 788554, 789517, 790062, 791662, 791723, 792949,
793085, 797183, 799847, 800392, 802211, 802928, 809214,
809329, 813974, 814449, 815055, 815324, 816375, 816920,
820249, 821771, 822122, 824304, 824779, 826141, 826705,
827250, 828819, 829832, 829874, 830885, 831940, 834590,
834969, 835514, 837175, 838537, 840832, 841359, 842443,
842457, 845439, 846801, 847683, 849096, 850492, 850534,
851497, 852042, 854410, 857174, 861448, 864639, 864681,
864908, 866494, 867310, 867648, 869248, 869703, 872349,
872475, 873043, 873186, 873447, 873888, 875446, 875901,
878355, 878900, 882535, 882805, 883373, 884218, 886252,
890153, 897247, 897964, 898300, 900233, 900725, 903339,
904437, 910062, 910288, 914060, 915841, 916078, 917609,
918154, 920487, 921421, 922270, 932957, 937264, 937306,
944765, 945002, 945818, 947380, 949875, 952779, 953049,
954462, 959227, 959464, 960280, 961187, 961313, 961881,
962024, 962285, 962726, 963253, 963379, 963947, 964090,
964351, 964792, 966188, 966230, 968411, 970990, 973012,
973689, 974406, 975199, 983386, 984939, 990111, 990237,
990805, 990948, 991650, 993208, 993663, 996309, 997407,
997585, 998995, 999406, 999861, 999929, 1005604, 1006059,
1006745, 1007462, 1011426, 1011831, 1012785, 1017995, 1026144,
1026259, 1030020, 1034474, 1039124, 1039245, 1043714, 1045001,
1046762, 1046804])
ipdb> ind_b
array([ 80584, 80713, 80727, 80830, 80989, 81022, 81155, 81156,
81386, 81417, 81542, 81544, 81589, 81654, 81655, 81717,
81731, 81740, 81744, 81770, 81875, 81893, 81935, 81995,
82005, 82009, 82034, 82073, 82122, 82237, 82256, 82260,
82290, 82307, 82310, 82318, 82429, 82563, 261040, 261945,
447398, 448136, 455408, 506794, 525096, 525106, 525786, 525858,
526012, 526196, 526639, 526686, 698398, 698430, 698545, 698596,
698608, 698667, 698687, 698771, 698851, 698884, 699040, 699103,
699122, 699132, 699153, 699199, 699350, 699399, 699437, 699715,
699737, 699976, 699981, 700117, 700150, 700313, 822122, 909129,
909443, 909907, 910062, 910127, 910288, 910806, 911031, 966188])
ipdb> up
> /usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py(430)_find_impostors()
428
429 ind_impostors = _find_impostors_blockwise(X_embedded, margin_radii,
--> 430 ind_b, ind_a)
431
432 n_impostors = len(ind_impostors)
ipdb> ind_a
array([ 1, 3, 5, ..., 5298, 5306, 5313])
ipdb> ind_b
array([ 0, 2, 4, ..., 5310, 5311, 5312])
ipdb>
Hi @angelotc , thanks for catching this. I think I know what's wrong. I changed something in the last commit, but it's easy to fix.
@angelotc, did you try the new commit from @johny-c on your example again ? It would be interesting to know if it works now
@angelotc, did you try the new commit from @johny-c on your example again ? It would be interesting to know if it works now
Yes, the algorithm performs well with .98 AUC with distinguishing cuprates from non-cuprates on our superconductors dataset. However, my research group is more interested in unsupervised methods; e.g. if we train the model on in-domain samples, how well does it perform on detecting the risk of out-of-domain samples? Out-of-domain samples can be of different classes. Through my research, Isolation Forest does not work well since it's highly correlated with the euclidian distance from the centroid. So we have started visiting the raw mahalanobis distance to the centroid of the training set, which works well for diffusion dataset, but not for our superconductors datasets. I do see that there is the Covariance method from the metric-learn module which I will explore in the next few weeks.
Yes, the algorithm performs well with .98 AUC with distinguishing cuprates from non-cuprates on our superconductors dataset. However, my research group is more interested in unsupervised methods; e.g. if we train the model on in-domain samples, how well does it perform on detecting the risk of out-of-domain samples? Out-of-domain samples can be of different classes. Through my research, Isolation Forest does not work well since it's highly correlated with the euclidian distance from the centroid. So we have started visiting the raw mahalanobis distance to the centroid of the training set, which works well for diffusion dataset, but not for our superconductors datasets. I do see that there is the Covariance method from the metric-learn module which I will explore in the next few weeks.
@angelotc Sorry for the very late reply, indeed, the Covariance method can be interesting, since it returns a raw Mahalanobis metric computed from the training samples, so if it works well as a metric to the centroid in some cases, maybe it would even work well as a general metric (like for computing distances of test samples to the closest sample from the tranining set, not necessarily the centroid, to see if this test sample is an outlier or not)
Also, Covariance is currently the only truly unsupervised metric learning algorithm in metric-learn
@johny-c thanks a lot for the commits Sorry for my very late reply, everything looks good to me, there's just this tiny point about the weights in the cost function (see comment https://github.com/scikit-learn-contrib/metric-learn/pull/309#discussion_r681524257), let me know what you think, as well as the extra test (https://github.com/scikit-learn-contrib/metric-learn/pull/309#discussion_r681592065), and it should be good !
Hi @wdevazelhes , I added back the test with more samples and removing the callbacks part. Regarding the push loss weight, see my reply above. I guess I can more clearly state what happens in the docstring. Other than that, is it good to go?
Dear @johny-c, thanks a lot for the great work on this. Sorry I haven't had time yet to take a close look at this but I will try to do this soon.
I don't remember if you provided somewhere some empirical comparison to the current implementation, to show for instance that it is faster in most cases while achieving similar accuracy? Of course it is quite clear that your implementation is more clear and easier to maintain :-)
Also if I'm not mistaken this implementation solves a slightly different optimization problem (the hinge loss is squared), can you confirm @johny-c? If so, we should probably add a note about this in the documentation.
Hi @bellet , thanks! Yes, the objective here uses squared distances as in the paper (https://proceedings.neurips.cc/paper/2005/file/a7f592cef8b130a6967a90617db5681b-Paper.pdf). I haven't done any comparisons to the previous version. Do you have any specific data sets in mind? I can do this, but I'm not sure when I will have free time.
Hi @bellet , thanks! Yes, the objective here uses squared distances as in the paper (https://proceedings.neurips.cc/paper/2005/file/a7f592cef8b130a6967a90617db5681b-Paper.pdf). I haven't done any comparisons to the previous version. Do you have any specific data sets in mind? I can do this, but I'm not sure when I will have free time.
Hi @johny-c, I was referring to the use of the squared hinge loss (that is, a max squared instead of a max in the constraint/loss), which is needed to make the loss differentiable.
@johny-c for a quick benchmark, I think a few standard classification datasets (like the ones in sklearn) would be sufficient. Just to show that the solutions found by this implementation lead to roughly the same accuracy in KNN
Hi all,
this is a PR to replace LMNN. I have submitted this version also to scikit-learn some time ago (see https://github.com/scikit-learn/scikit-learn/pull/8602) and @bellet suggested to submit here too. I removed some of the bells and whistles submitted to
scikit-learn
to make the code more readable and more easily maintainable, while following the conventions ofmetric-learn
.There are currently 9 tests failing, all on the same test function, namely the components initialization via
PCA
orLDA
. I'm not sure what is the appropriate way to handle these failures inmetric-learn
. I'm guessing other algorithms have faced the same failures, so I'm starting this PR as is and await for comments.