rs-station / reciprocalspaceship

Tools for exploring reciprocal space
https://rs-station.github.io/reciprocalspaceship/
MIT License
28 stars 12 forks source link

Add `stats` submodule with `compute_completeness()` function #118

Closed JBGreisman closed 2 years ago

JBGreisman commented 2 years ago

This PR adds an initial pass at rs.stats.compute_completeness(). This function takes a DataSet as input, and returns the completeness by resolution bins. It is written to work for both merged and unmerged DataSet objects. It also takes an anomalous flag which can be used to compute the anomalous completeness.


docstring:

In [1]: rs.stats.compute_completeness?
Signature: rs.stats.compute_completeness(dataset, bins=10, anomalous=False, dmin=None)
Docstring:
Compute completeness of DataSet by resolution bin.

Parameters
----------
dataset : rs.DataSet
    DataSet object to be analyzed
bins : int
    Number of resolution bins to use
anomalous : bool
    Whether to compute the anomalous completeness
dmin : float
    Resolution cutoff to use. If no dmin is supplied, the maximum resolution
    reflection will be used

Returns
-------
rs.DataSet
    DataSet object summarizing the completeness by resolution bin
File:      ~/Documents/Hekstra_Lab/github/reciprocalspaceship/reciprocalspaceship/stats/completeness.py
Type:      function

Example:

In [2]: rs.stats.compute_completeness(dataset, dmin=1.5)
Out[2]: 
              completeness
49.47 - 3.31      0.978490
3.31 - 2.60       1.000000
2.60 - 2.26       1.000000
2.26 - 2.05       0.981612
2.05 - 1.90       0.996479
1.90 - 1.79       0.996871
1.79 - 1.69       0.994523
1.69 - 1.62       0.993740
1.62 - 1.55       0.989437
1.55 - 1.50       0.992178
overall           0.992333

To do:

codecov-commenter commented 2 years ago

Codecov Report

Merging #118 (fca6e52) into main (ae0e2f5) will decrease coverage by 0.19%. The diff coverage is 96.12%.

:exclamation: Current head fca6e52 differs from pull request most recent head 1f1b3fc. Consider uploading reports for the commit 1f1b3fc to get more accurate results

@@            Coverage Diff             @@
##             main     #118      +/-   ##
==========================================
- Coverage   98.67%   98.48%   -0.20%     
==========================================
  Files          41       43       +2     
  Lines        1590     1712     +122     
==========================================
+ Hits         1569     1686     +117     
- Misses         21       26       +5     
Flag Coverage Δ
unittests 98.48% <96.12%> (-0.20%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
reciprocalspaceship/dtypes/base.py 95.49% <50.00%> (-0.84%) :arrow_down:
reciprocalspaceship/stats/__init__.py 75.00% <75.00%> (ø)
reciprocalspaceship/utils/binning.py 90.62% <87.50%> (-9.38%) :arrow_down:
reciprocalspaceship/__init__.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/anomalousdifference.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/batch.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/hklindex.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/intensity.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/m_isym.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/mtzint.py 100.00% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ae0e2f5...1f1b3fc. Read the comment docs.

JBGreisman commented 2 years ago

For unmerged data, this function currently works whether the reflections are provided in or out of the reciprocal ASU.

For merged reflections, computing anomalous completeness has the added difficulty of how to handle "filler" values for unobserved reflections. Although programs often use NaNs for fully unobserved reflections (so dropna() can be used to remove those extras), phenix and aimless use different conventions for cases where only one copy of a Friedel pair was observed. Specifically, phenix fills them as NaNs and Aimless just puts 0.0 for all values. I can't quite think of a general way to handle this discrepancy at the moment to make this function "just work" in all cases.

JBGreisman commented 2 years ago

Just to follow up on this, I think that the best option here to resolve the discrepancies in "filler values" across programs will be to add an argument to the call signature called filler_value=np.nan or unobserved_value=np.nan.

This can then be used to allow the user to achieve the desired behavior. I'll also update the docstring to be a bit more explicit about the behavior with: 1) unmerged data 2) merged data; 1-col anomalous and 3) merged data; 2-col anomalous.

I also think I will expand this method to support all forms of completeness that are reported by aimless. Namely:

JBGreisman commented 2 years ago

This is ready to go in my mind:

> rs.mtzdump tests/data/algorithms/HEWL_SSAD_24IDC.mtz --embed

In [1]: mtz.sample(10)
Out[1]: 
          FreeR_flag     IMEAN  SIGIMEAN      I(+)    SIGI(+)      I(-)   SIGI(-)  N(+)  N(-)
H  K  L                                                                                      
33 1  6            5 11.798488 0.6114557 13.792131 0.85277236  9.688977  0.877203    48    40
44 10 4           12 90.181526  5.098182 90.181526   5.098182       0.0       0.0     8     0
25 16 9            6  164.7444 2.9361315  154.9134   4.022536 175.95505  4.295529    32    28
31 12 1           11  16.75307 0.4626905 14.979586 0.59099203 19.561523 0.7437062    38    34
30 6  9           13  92.40724 1.5158632 103.99409  2.2297392  82.44996 2.0670066    48    52
10 9  12           5  313.1947  3.920187 323.86127  5.6489615  303.2858 5.4446454    60    64
4  2  16           6 580.68256 10.024454  551.0918  14.140535  610.5782 14.213181    32    32
21 14 8           10 201.05835 2.4446104 221.15997  3.5505462 182.93996 3.3708513    60    64
35 5  8           19 268.03336 5.2958283 259.67307   7.117649  278.4012  7.926307    29    24
37 2  3           16 31.599096 0.6746577 30.608599 0.94752496 32.617615 0.9608345    60    60

In [2]: rs.stats.compute_completeness(mtz, anomalous=False,  unobserved_value=0)
Out[2]: 
              completeness
             non-anomalous
56.10 - 3.91      0.997613
3.91 - 3.06       1.000000
3.06 - 2.66       0.997613
2.66 - 2.41       1.000000
2.41 - 2.23       1.000000
2.23 - 2.09       0.997615
2.09 - 1.98       0.991304
1.98 - 1.89       0.994444
1.89 - 1.82       0.994449
1.82 - 1.70       0.529089
overall           0.915936

In [3]: rs.stats.compute_completeness(mtz, anomalous=True,  unobserved_value=0)
Out[3]: 
             completeness                        
                      all non-anomalous anomalous
56.10 - 3.91     0.998574      0.997613  1.000000
3.91 - 3.06      1.000000      1.000000  1.000000
3.06 - 2.66      0.998252      0.997613  0.999030
2.66 - 2.41      0.999136      1.000000  0.999057
2.41 - 2.23      1.000000      1.000000  1.000000
2.23 - 2.09      0.996159      0.997615  0.996313
2.09 - 1.98      0.992803      0.991304  0.994531
1.98 - 1.89      0.994510      0.994444  0.995036
1.89 - 1.82      0.994503      0.994449  0.995471
1.82 - 1.70      0.492174      0.529089  0.500952
overall          0.907413      0.915936  0.906873