Closed kmdalton closed 2 years ago
yes -- I think this is a good idea. I think adding a stats
subdirectory seems like a good place for such functions to live.
I would mark this as low-priority, FWIW. I did do a simple implementation previously (https://github.com/Hekstra-Lab/double-wilson/blob/main/old/DoubleWIlsonExplorations.ipynb), but it's kind of slow.
@DHekstra , I think I need this for processing the GFP data.
Gemmi will have a function to calculate completeness at some point, but I don't know when.
Sounds good. Multiplicity and completeness would be useful stats to be able to compute, and could live in rs.stats
. Computing those statistics using the groupby
functionality in pandas
will also make for a rather convenient way to summarize over resolution bins.
I will go ahead and implement something to get us started.
rs.utils
which can be used to address this issue. However, we still need to wrap the tools I added to become part of the top-level API or with a convenient function in the rs.stats
module.Here is a snippet that uses the stats function introduced in #60 and packages the results as a DataSet indexed by resolution bin. I'm just putting this here as a placeholder snippet for when I get around to implementing this:
h, counts = rs.utils.compute_redundancy(mtz.get_hkls(), mtz.cell, mtz.spacegroup)
ds = rs.DataSet({"n": counts, "H": h[:, 0], "K": h[:, 1], "L": h[:, 2]}, spacegroup=mtz.spacegroup, cell=mtz.cell)
ds.set_index(["H", "K", "L"], inplace=True)
ds, labels = ds.assign_resolution_bins(10)
ds["observed"] = ds["n"] > 0
result = ds.groupby("bin")["observed"].agg(["sum", "count"])
result["completeness"] = result["sum"] / result["count"]
result.index = labels
result["completeness"]
The resulting DataSet looks like this (for an example dataset that is admittedly low-completeness):
99.53 - 3.70 0.736536
3.70 - 2.93 0.779202
2.93 - 2.55 0.797380
2.55 - 2.31 0.807513
2.31 - 2.15 0.820326
2.15 - 2.02 0.825619
2.02 - 1.92 0.835808
1.92 - 1.83 0.836003
1.83 - 1.76 0.838672
1.76 - 1.70 0.839348
Name: completeness, dtype: float64
I often find myself using external tools to compute completeness, but this is something we could easily implement within
rs
. I imagine there are other stats we might want to have baked in as well. I propose we add acompute_completeness
function with an optionalbins
argument. I would be in favor of adding anrs.stats
namespace for it to live in.How does this plan sit with you, @JBGreisman ?