Quick Start Notebook - Githubissues

FarnazH commented 10 months ago

This PR contains the quick_start.ipynb showcasing various functionalities of the package alongside a clear comparison of methods. While working on this notebook, I improved the package code/docstring and fixed some bugs. These changes are directly pushed to the main branch so that we can move faster with our release. The method comparison figures (selecting from one cluster) have been added to the paper. There is still some work that needs to be done.

@marco-2023, can you please:

Complete the two TODO items comparing the sections through diversity measures.
Add an example of the selection method.
Review this notebook (feel free to go ahead and make changes and push to this branch).

@maximilianvz, can you please review this PR and share any comments you have on the notebook?

marco-2023 commented 10 months ago

@FarnazH @maximilianvz I went through the notebook and:

Added a function to render the tables (list of lists) as markdown cells
Completed the TODO items
Added an example of selection using the n-similarity methods
Used the n-similarity methods to compute diversity in one case. (this is extra and would like your opinion on this).

Can you tell me what you think about it after a quick look?

Should we calculate 4. for all selections or not use it altogether?
All n-similarity methods selected the same points. According to Ramon, this is to be expected due to the (low) dimensionality of the problem at hand. Should we just use one then?

Several things to note are:

I was only able to use two diversity measures from the diversity module. The rest need binary chains as elements.
The documentation of the n-similarity methods is not being generated on the website.

maximilianvz commented 10 months ago

@FarnazH and @marco-2023, I have several comments:

There are some typos in the notebook (I may have missed some, so I encourage others to check for this, too):

On 9 occasions, "medoid" is misspelled as "mediod".
In the explanation under Example 1: MaxMin Selector, it is stated that "This can a user-defined function or a sklearn.metrics.pairwise_distances function". This needs to be changed to "This can BE a user-defined function or a sklearn.metrics.pairwise_distances function".
In the documentation for the render_table function, the following is said: "The data to be rendered in a table, each inner list represents a row with the first row being the header. All" I believe the trailing "All" was written mistakenly.

In this part of the notebook, the header is "N-Similarity based methods". However, the title of the plot in this section is "Comparing (N)Similarity-Based Selectors". I'd appreciate consistency in how "N-Similarity Based"/"(N)Similarity-Based" is written throughout the notebook:

There is a warning in this part of the notebook (which doesn't seem to affect performance) that would be nice to have removed:

This is a small thing, but for new users, it may be helpful to include a sentence at the end of the explanation here that clarifies that larger values for logdet and smaller values for wdudcorrespond to greater diversity.

When comparing multiple selection methods on data within a single cluster, comparisons of selection diversity are given for distance-based, partition-based, and N-similarity-based methods. However, in the latter half of the notebook, where selection is done for data with multiple clusters, there is no such diversity comparison performed. It would probably be best to add this.
@marco-2023, you would know more about this than me, but you point out that the selected sets are the same for all similarity indices, which is a consequence of the data being low-dimensional. I assume this is unavoidable if we're using 2-dimensional data (which was done so things could be easily visualized), but I'm not sure how effectively this conveys the usefulness of the various similarity indices of the N-similarity-based methods offered by the package. It may not be worth the effort (i.e., I'm not adamant that things need to be changed here), but perhaps we could use a higher-dimensional example and sacrifice visualization in instances where we want to showcase the N-similarity-based methods and how indices affect diversity. If this were done, we should include a note warning that the choice of similarity index won't affect selection diversity when working in low-dimensional space. Alternatively, we can just use one similarity index and provide this warning.

(this is unrelated to the notebook) @FarnazH, correct me if I'm wrong, but I believe we should update the documentation of OptiSim, which currently states that the medoid centre is chosen as the initial point. Like DISE, OptiSim has a ref-index argument with a default value of zero, so it isn't guaranteed that the medoid center is the initial point (in most cases, I'd imagine it won't be):

FanwangM commented 3 months ago

When computing diversity, a distance matrix is used. We should use the feature matrix instead I think.

theochem / Selector

Quick Start Notebook #186