scikit-learn-contrib / metric-learn

Metric learning algorithms in Python
http://contrib.scikit-learn.org/metric-learn/
MIT License
1.38k stars 231 forks source link

Error when taking sample data for SCML #348

Closed nikosmichas closed 2 years ago

nikosmichas commented 2 years ago

Summary

Use replace=True when taking sample data for SCML https://github.com/scikit-learn-contrib/metric-learn/blob/master/metric_learn/scml.py#L268

Use Cases

When would you use this?

When having data with large number of features, the following error occurs:

  File "venv/lib/python3.9/site-packages/metric_learn/scml.py", line 268, in _generate_bases_dist_diff
    select_triplet = rng.choice(n_triplets, size=n_features, replace=False)
  File "mtrand.pyx", line 959, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'

By changing the replace option to True we will be able to use repeated data instances without raising an error. Is it correct to use such an approach?


Message from the maintainers:

Want to see this feature happen? Give it a 👍. We prioritise the issues with the most 👍.

grudloff commented 2 years ago

I don't think this is a good option, since this would just add redundant triplets. Furthermore, this scenario seems ill-posed for this basis generation scheme since on every iteration the eigenvectors are the same, in consequence, the added bases are the same.

SCML constructs a Mahalanobis matrix from a sparse composition of bases, so I would speculate that since the actual number of unique bases is small it would just select that subset.

I would propose two solutions, use a dimensionality reduction preprocessing or add a new basis generation scheme that essentially does the same as one iteration in the current scheme but over the full set of triplets. The current implementation with basis="triplet_diffs" is just one way to generate the basis set and the parameter basis is intended to select among options to generate this basis set.

nikosmichas commented 2 years ago

Thanks for the suggestion @grudloff I tried dimensionality reduction and results look ok. Regarding the second solution, is there an example of something like this?

Finally, maybe these suggestions could be presented to the user instead of the exception that I posted above?

grudloff commented 2 years ago

Hello @nikosmichas! Glad it helped. You just have to copy the current _generate_bases_dist_diff(triplets, X), remove the while-loop and replace all d_posfor diff_pos and d_negfor diff_neg and then the basis set is just all the positive-eigenvalued eigenvectors basis = v[pos_eig_mask]. Then with this custom basis generation function you pass the basis set to the SCML instance through the basis attribute. Something like this:

basis = custom_basis_func(triplets, X)
scml = SCML(basis=basis)
scml.fit(triplets)

Remember though, that I speculate the behavior being the same as setting replace=True but in a more efficient manner.

Yes! a more informative exception would be great. Would you be willing to contribute to it? Are you knowledgeable in test suites?

nikosmichas commented 2 years ago

Thanks again @grudloff Tried this approach but doesn't seem to work for my case. It is much slower than using a PCA as a preprocessing, and the final results are worse. Opened a PR #350 for the exception message

bellet commented 2 years ago

We now return a proper error message (#350) and @grudloff provided alternative strategies so I think we can close this one. Thanks @nikosmichas