theochem / Selector

Python library of algorithms for selecting diverse subsets of data for machine-learning.
https://selector.qcdevs.org
GNU General Public License v3.0
22 stars 21 forks source link

Quick Start Notebook #186

Closed FarnazH closed 1 month ago

FarnazH commented 10 months ago

This PR contains the quick_start.ipynb showcasing various functionalities of the package alongside a clear comparison of methods. While working on this notebook, I improved the package code/docstring and fixed some bugs. These changes are directly pushed to the main branch so that we can move faster with our release. The method comparison figures (selecting from one cluster) have been added to the paper. There is still some work that needs to be done.

@marco-2023, can you please:

  1. Complete the two TODO items comparing the sections through diversity measures.
  2. Add an example of the selection method.
  3. Review this notebook (feel free to go ahead and make changes and push to this branch).

@maximilianvz, can you please review this PR and share any comments you have on the notebook?

marco-2023 commented 10 months ago

@FarnazH @maximilianvz I went through the notebook and:

  1. Added a function to render the tables (list of lists) as markdown cells
  2. Completed the TODO items
  3. Added an example of selection using the n-similarity methods
  4. Used the n-similarity methods to compute diversity in one case. (this is extra and would like your opinion on this).

Can you tell me what you think about it after a quick look?

Several things to note are:

  1. I was only able to use two diversity measures from the diversity module. The rest need binary chains as elements.
  2. The documentation of the n-similarity methods is not being generated on the website.
maximilianvz commented 10 months ago

@FarnazH and @marco-2023, I have several comments:

  1. On 9 occasions, "medoid" is misspelled as "mediod".
  2. In the explanation under Example 1: MaxMin Selector, it is stated that "This can a user-defined function or a sklearn.metrics.pairwise_distances function". This needs to be changed to "This can BE a user-defined function or a sklearn.metrics.pairwise_distances function".
  3. In the documentation for the render_table function, the following is said: "The data to be rendered in a table, each inner list represents a row with the first row being the header. All" I believe the trailing "All" was written mistakenly.
Screenshot 2023-12-01 at 3 33 38 PM

image

Screenshot 2023-12-01 at 3 39 01 PM Screenshot 2023-12-01 at 3 45 11 PM

image

FanwangM commented 3 months ago

When computing diversity, a distance matrix is used. We should use the feature matrix instead I think.