kmdalton commented 3 years ago

I often find myself using external tools to compute completeness, but this is something we could easily implement within rs. I imagine there are other stats we might want to have baked in as well. I propose we add a compute_completeness function with an optional bins argument. I would be in favor of adding an rs.stats namespace for it to live in.

How does this plan sit with you, @JBGreisman ?

JBGreisman commented 3 years ago

yes -- I think this is a good idea. I think adding a stats subdirectory seems like a good place for such functions to live.

DHekstra commented 3 years ago

I would mark this as low-priority, FWIW. I did do a simple implementation previously (https://github.com/Hekstra-Lab/double-wilson/blob/main/old/DoubleWIlsonExplorations.ipynb), but it's kind of slow.

kmdalton commented 3 years ago

@DHekstra , I think I need this for processing the GFP data.

wojdyr commented 3 years ago

Gemmi will have a function to calculate completeness at some point, but I don't know when.

JBGreisman commented 3 years ago

Sounds good. Multiplicity and completeness would be useful stats to be able to compute, and could live in rs.stats. Computing those statistics using the groupby functionality in pandas will also make for a rather convenient way to summarize over resolution bins.

kmdalton commented 3 years ago

I will go ahead and implement something to get us started.

kmdalton commented 3 years ago

60 introduces low level tools to `rs.utils` which can be used to address this issue. However, we still need to wrap the tools I added to become part of the top-level API or with a convenient function in the `rs.stats` module.

JBGreisman commented 2 years ago

Here is a snippet that uses the stats function introduced in #60 and packages the results as a DataSet indexed by resolution bin. I'm just putting this here as a placeholder snippet for when I get around to implementing this:

h, counts = rs.utils.compute_redundancy(mtz.get_hkls(), mtz.cell, mtz.spacegroup)
ds = rs.DataSet({"n": counts, "H": h[:, 0], "K": h[:, 1], "L": h[:, 2]}, spacegroup=mtz.spacegroup, cell=mtz.cell)
ds.set_index(["H", "K", "L"], inplace=True)
ds, labels = ds.assign_resolution_bins(10)
ds["observed"] = ds["n"] > 0
result = ds.groupby("bin")["observed"].agg(["sum", "count"])
result["completeness"] = result["sum"] / result["count"]
result.index = labels
result["completeness"]

The resulting DataSet looks like this (for an example dataset that is admittedly low-completeness):

99.53 - 3.70    0.736536
3.70 - 2.93     0.779202
2.93 - 2.55     0.797380
2.55 - 2.31     0.807513
2.31 - 2.15     0.820326
2.15 - 2.02     0.825619
2.02 - 1.92     0.835808
1.92 - 1.83     0.836003
1.83 - 1.76     0.838672
1.76 - 1.70     0.839348
Name: completeness, dtype: float64

rs-station / reciprocalspaceship

Add a function to compute completeness #59

60 introduces low level tools to `rs.utils` which can be used to address this issue. However, we still need to wrap the tools I added to become part of the top-level API or with a convenient function in the `rs.stats` module.

rs-station / reciprocalspaceship

Add a function to compute completeness #59

60 introduces low level tools to rs.utils which can be used to address this issue. However, we still need to wrap the tools I added to become part of the top-level API or with a convenient function in the rs.stats module.

60 introduces low level tools to `rs.utils` which can be used to address this issue. However, we still need to wrap the tools I added to become part of the top-level API or with a convenient function in the `rs.stats` module.