Closed FarnazH closed 1 month ago
@FarnazH @maximilianvz I went through the notebook and:
Can you tell me what you think about it after a quick look?
Several things to note are:
diversity
module. The rest need binary chains as elements.@FarnazH and @marco-2023, I have several comments:
sklearn.metrics.pairwise_distances
function". This needs to be changed to "This can BE a user-defined function or a sklearn.metrics.pairwise_distances
function".render_table
function, the following is said: "The data to be rendered in a table, each inner list represents a row with the first row being the header. All" I believe the trailing "All" was written mistakenly.logdet
and smaller values for wdud
correspond to greater diversity.When comparing multiple selection methods on data within a single cluster, comparisons of selection diversity are given for distance-based, partition-based, and N-similarity-based methods. However, in the latter half of the notebook, where selection is done for data with multiple clusters, there is no such diversity comparison performed. It would probably be best to add this.
@marco-2023, you would know more about this than me, but you point out that the selected sets are the same for all similarity indices, which is a consequence of the data being low-dimensional. I assume this is unavoidable if we're using 2-dimensional data (which was done so things could be easily visualized), but I'm not sure how effectively this conveys the usefulness of the various similarity indices of the N-similarity-based methods offered by the package. It may not be worth the effort (i.e., I'm not adamant that things need to be changed here), but perhaps we could use a higher-dimensional example and sacrifice visualization in instances where we want to showcase the N-similarity-based methods and how indices affect diversity. If this were done, we should include a note warning that the choice of similarity index won't affect selection diversity when working in low-dimensional space. Alternatively, we can just use one similarity index and provide this warning.
OptiSim
, which currently states that the medoid centre is chosen as the initial point. Like DISE
, OptiSim
has a ref-index
argument with a default value of zero, so it isn't guaranteed that the medoid center is the initial point (in most cases, I'd imagine it won't be):When computing diversity, a distance matrix is used. We should use the feature matrix instead I think.
This PR contains the
quick_start.ipynb
showcasing various functionalities of the package alongside a clear comparison of methods. While working on this notebook, I improved the package code/docstring and fixed some bugs. These changes are directly pushed to themain
branch so that we can move faster with our release. The method comparison figures (selecting from one cluster) have been added to the paper. There is still some work that needs to be done.@marco-2023, can you please:
TODO
items comparing the sections through diversity measures.@maximilianvz, can you please review this PR and share any comments you have on the notebook?