scikit-learn-contrib / scikit-matter

A collection of scikit-learn compatible utilities that implement methods born out of the materials science and chemistry communities
https://scikit-matter.readthedocs.io/en/v0.2.0/
BSD 3-Clause "New" or "Revised" License
76 stars 20 forks source link

Ensure FPS/CUR selected idxs are unique in the case of zero-score #224

Open jwa7 opened 6 months ago

jwa7 commented 6 months ago

Attempting to fix #206

Hello!

I have encountered the issue raised by Alex in the above issue, whilst working with the equisolve wrapper for TensorMap-based sample/feature selection (i.e. in https://github.com/lab-cosmo/equisolve/blob/main/src/equisolve/numpy/sample_selection.py and co).

I have adapted Alex's example into a few unit tests for both FPS and CUR sample/feature selection, and attempted to fix it. However, there is something I'm not understanding. While the FPS tests now pass, there are a couple of (different) CUR ones that do not.

With this PR I was hoping to get some feedback/help from the skmatter dev team on this. Thanks! :)

Contributor (creator of PR) checklist

For Reviewer


📚 Documentation preview 📚: https://scikit-matter--224.org.readthedocs.build/en/224/

rosecers commented 3 months ago

Are you still looking for feedback on this? Good measure to tag one of us for input.

jwa7 commented 3 months ago

Hey! This isn't something I'm actively working on at the moment, but I'll be sure to ping you for feedback when I / we get back round to it :)

mhellstr commented 2 weeks ago

I had the problem that CUR sample selection would give duplicate selected indices, for example X.selectedidx == [ 801 936 1308 253 1000 480 183 486 303 977 43 366 734 243 363 88 859 440 798 398 709 263 796 383 214 654 854 508 865 384 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643 413 643]

I cannot comment on the validity of the code, but this branch solves this problem to instead give unique predictions like this, X.selectedidx == [ 801 936 1308 253 1000 480 183 486 303 977 43 366 734 243 363 88 859 440 798 398 709 263 796 383 214 654 854 508 865 384 413 643 216 555 408 564 892 716 673 99 107 386 137 55 1 499 368 390 359 218 237 130 530 661 439 311 318 542 669 830 268 208 215 903 418 855 994 664 631 879 199 306 869 258 884 1418 123 700 266 1282 608 519 351 389 5 652 553 257 934 24 1455 270 825 744 1114 470 757 572 1259 15]