Closed nikosmichas closed 2 years ago
I don't think this is a good option, since this would just add redundant triplets. Furthermore, this scenario seems ill-posed for this basis generation scheme since on every iteration the eigenvectors are the same, in consequence, the added bases are the same.
SCML constructs a Mahalanobis matrix from a sparse composition of bases, so I would speculate that since the actual number of unique bases is small it would just select that subset.
I would propose two solutions, use a dimensionality reduction preprocessing or add a new basis generation scheme that essentially does the same as one iteration in the current scheme but over the full set of triplets. The current implementation with basis="triplet_diffs"
is just one way to generate the basis set and the parameter basis
is intended to select among options to generate this basis set.
Thanks for the suggestion @grudloff I tried dimensionality reduction and results look ok. Regarding the second solution, is there an example of something like this?
Finally, maybe these suggestions could be presented to the user instead of the exception that I posted above?
Hello @nikosmichas! Glad it helped. You just have to copy the current _generate_bases_dist_diff(triplets, X)
, remove the while-loop and replace all d_pos
for diff_pos
and d_neg
for diff_neg
and then the basis set is just all the positive-eigenvalued eigenvectors basis = v[pos_eig_mask]
. Then with this custom basis generation function you pass the basis set to the SCML instance through the basis attribute. Something like this:
basis = custom_basis_func(triplets, X)
scml = SCML(basis=basis)
scml.fit(triplets)
Remember though, that I speculate the behavior being the same as setting replace=True
but in a more efficient manner.
Yes! a more informative exception would be great. Would you be willing to contribute to it? Are you knowledgeable in test suites?
Thanks again @grudloff Tried this approach but doesn't seem to work for my case. It is much slower than using a PCA as a preprocessing, and the final results are worse. Opened a PR #350 for the exception message
We now return a proper error message (#350) and @grudloff provided alternative strategies so I think we can close this one. Thanks @nikosmichas
Summary
Use replace=True when taking sample data for SCML https://github.com/scikit-learn-contrib/metric-learn/blob/master/metric_learn/scml.py#L268
Use Cases
When would you use this?
When having data with large number of features, the following error occurs:
By changing the
replace
option toTrue
we will be able to use repeated data instances without raising an error. Is it correct to use such an approach?Message from the maintainers:
Want to see this feature happen? Give it a 👍. We prioritise the issues with the most 👍.