theochem / Selector

Python library of algorithms for selecting diverse subsets of data for machine-learning.
https://selector.qcdevs.org
GNU General Public License v3.0
22 stars 20 forks source link

[Distance module] Add Formulas to Docstrings and Add Tests #125

Open FarnazH opened 1 year ago

FarnazH commented 1 year ago

Finalize the functions in distance module by:

  1. Unifying docstrings and adding formulas (that match the implementation)
  2. Add tests for each function (to cover all lines) and tricky cases
  3. Polish code to make sure it is readable (e.g., add comments) and pythonic
    ### Tasks
    - [x] Move `sklearn_supported_metrics` from `utils` module to `distance` module
    - [x] Move tests related to distance module into `test_distance.py`
    - [x] Remove support for `sklearn.pairwise_distances` function as it only adds overhead to the code.
    - [x] Add formulas
    - [x] Add tests
### Tasks - Post PR#129
- [x] Test coverges: missing lines 184 and 189.
- [x] Fix math formula: https://github.com/theochem/DiverseSelector/blob/main/DiverseSelector/distance.py#L149
- [x] Having both `pairwise_similarity_bit` and `compute_distance_matrix` seems redundant considering the fact that we have the #123 module.
- [x] Remove empty lines at the end of the modules, like https://github.com/theochem/DiverseSelector/blob/main/DiverseSelector/test/test_distance.py#L99
- [x] Add tests to explicitly test `tanimoto` and `modified_tanimoto`
- [x] Implementation needs to be improved and further clarified: https://github.com/theochem/DiverseSelector/blob/main/DiverseSelector/distance.py#L223
- [x] Needs more clarification in docstring and comments of https://github.com/theochem/DiverseSelector/blob/main/DiverseSelector/distance.py#L143
FarnazH commented 1 year ago

I leave a note here, as this PR #137 is already merged. Before this issue can be closed:

FanwangM commented 1 year ago

One more thing we need to have is adding explanations for MaxMin and OptiSim as discussed in #119.

PaulWAyers commented 1 year ago

Yes, this is a diversity measure. I agree that it should be moved.

  • Shouldn't the nearest_average_tanimoto be moved to the diversity.py module?