Closed FanwangM closed 1 year ago
Merging #177 (91d1431) into main (9c74867) will decrease coverage by
0.10%
. Report is 7 commits behind head on main. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #177 +/- ##
==========================================
- Coverage 85.49% 85.39% -0.10%
==========================================
Files 13 13
Lines 1020 1020
==========================================
- Hits 872 871 -1
- Misses 148 149 +1
Files | Coverage Δ | |
---|---|---|
DiverseSelector/diversity.py | 98.51% <100.00%> (-0.77%) |
:arrow_down: |
DiverseSelector/tests/test_diversity.py | 100.00% <100.00%> (ø) |
The original usage of using Shannon entropy for molecule selection can be found in section 2.1 of https://pubs.acs.org/doi/abs/10.1021/ci900159f where we have a complete form of Shannon entropy,
SE = \sum_i^m SE_i
where
SE_i = - p_i \log_2{p_i } - (1 - p_i)\log_2(1 - p_i)
and
p_i = \sum_j^m x_{ij}
In Equation 2 of https://onlinelibrary.wiley.com/doi/10.1002/jcc.24423 in 2016, there is a scaling factor difference which ensures the value in the range of 0 and 1. This paper also introduced the Gini coefficient.
SE = - \frac{ \sum \frac{y_i}{N} \ln {\frac{y_i}{N} } } {L \frac {\ln 2} {2}}
where $y_i$ is the vector of bit counts (this vector is sorted in increasing order), $L$ is the length of the fingerprint, and $N$ is the number of molecules in a set.
In a J Cheminform paper in 2021, https://doi.org/10.1186/s13321-021-00554-8. The definition was further simplified into
SE = \sum_i^n -p_i \log p_i
which only counts the on-bits (only 1 in the bit array).
The original implementation was not correct and this PR fixes the problem, #67.