theochem / Selector

Python library of algorithms for selecting diverse subsets of data for machine-learning.
https://selector.qcdevs.org
GNU General Public License v3.0
22 stars 22 forks source link

Fix shannon entropy #177

Closed FanwangM closed 1 year ago

FanwangM commented 1 year ago

The original implementation was not correct and this PR fixes the problem, #67.

codecov[bot] commented 1 year ago

Codecov Report

Merging #177 (91d1431) into main (9c74867) will decrease coverage by 0.10%. Report is 7 commits behind head on main. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #177      +/-   ##
==========================================
- Coverage   85.49%   85.39%   -0.10%     
==========================================
  Files          13       13              
  Lines        1020     1020              
==========================================
- Hits          872      871       -1     
- Misses        148      149       +1     
Files Coverage Δ
DiverseSelector/diversity.py 98.51% <100.00%> (-0.77%) :arrow_down:
DiverseSelector/tests/test_diversity.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

FanwangM commented 1 year ago

The original usage of using Shannon entropy for molecule selection can be found in section 2.1 of https://pubs.acs.org/doi/abs/10.1021/ci900159f where we have a complete form of Shannon entropy,

SE = \sum_i^m SE_i

where

SE_i  =  - p_i \log_2{p_i }  - (1 - p_i)\log_2(1 - p_i)

and

p_i = \sum_j^m  x_{ij}

In Equation 2 of https://onlinelibrary.wiley.com/doi/10.1002/jcc.24423 in 2016, there is a scaling factor difference which ensures the value in the range of 0 and 1. This paper also introduced the Gini coefficient.

SE = - \frac{ \sum \frac{y_i}{N} \ln {\frac{y_i}{N} } } {L \frac {\ln 2} {2}}

where $y_i$ is the vector of bit counts (this vector is sorted in increasing order), $L$ is the length of the fingerprint, and $N$ is the number of molecules in a set.

In a J Cheminform paper in 2021, https://doi.org/10.1186/s13321-021-00554-8. The definition was further simplified into

SE = \sum_i^n -p_i \log p_i

which only counts the on-bits (only 1 in the bit array).

FanwangM commented 1 year ago

@FanwangM, extending the scope of the shannon_entropy function, shouldn't we remove the entropy function then?

Nice catch. I will do it soon.

Update: Fixed in 6e736ec.