Closed DHekstra closed 1 year ago
Ihis would be helpful for CCanom calculation on Careless xval#.mtz files using rsbooster.stats.ccanom.analyze_ccanom_mtz() directly. It is obviously not essential, since I could just use the make_halves_ccanom function in the same location first, but it seems undesirable for the default behavior to be to randomly pair repeats.
i have a feeling this is better handled at the top level by the user with a ds.groupby("repeat").apply(lambda x: x.unstack_anomalous(...))
or some similar formula. perhaps share the code you're working on and we can recommend a strategy.
the workaround in this case is quite simple--the problem only occurs if I stack and then unstack--then the repeats no longer match up. So the workaround is not to stack in the first place.
the stack / unstack_anomalous methods are not meant to be called on unmerged data. in this case, you may have several merged datasets which are concatenated inside one object. this means the miller indices have repeated values. hence, semantically they are the same as unmerged data. these methods should not be called in this situation, it may lead to a number of unpleasant side effects. the proper way to handle this is to scope each of the method calls within a groupby operator so that the miller indices are not redundant in each set.
for this reason, both methods raise a ValueError
when called on unmerged
data. the issue is that this is contingent on checking the dataset.merged
attribute which is, in this case set to True which might have been a bad design choice in careless. on the rs
end, we should probably directly test for redundant miller indices rather than rely on dataset.merged
. I wonder what @JBGreisman thinks.
Good point. I'm going the groupby
route for now, going off what is implemented in rsbooster.stats.anom.py
.
my initial take was that the "repeat"
was a sufficiently common case (for me, anyways) that we might want to implement a provision for it.
yeah i think this use case is going to be pretty unusual outside the hekstra lab. i don't think any other program besides careless makes this sort of data structure. this is maybe an indicator that careless shouldn't make these sorts of files :thinking:
My personal feeling is that rs
should be kept agnostic to careless
decisions regarding column naming. More broadly, I try to avoid any hard-coded column names, because I think that's a recipe for problematic corner cases and "user surprise." There are certainly exceptions for H
, K
, and L
and some internal column names, but generally I'd like to avoid hardcoding a case for repeat
.
As @kmdalton said, in my mind, the xval
mtz that gets output should be considered "unmerged" because it has repeat Miller indices. To my knowledge, CCanom
can be computed from careless output without any use of {stack/unstack}_anomalous
-- there is a commandline tool rs.ccanom
provided in rsbooster
that should be applicable to careless xval
output without any modification needed. If you're running into cases of careless output that are unsupported by it, please let me know or file a ticket on rsbooster
. rs.ccanom -h
can be used to see the arguments it takes. Its internals can also be used as a framework for implementing your own function if that is more useful.
Let me know if there's anything else I can provide as far as useful snippets/templates
OK, after some consideration, I agree with all of these design decisions. I agree that the commandline application supports this case already well. Part of the issue came from me wanting to call a function in rsbooster.stats.ccanom directly from in a notebook (which works fine as long as one is aware of the above). My goal is to provide some better support for Careless users who are trying to understand how to interpret the Careless output.
For future reference, this is a productive work-around:
half_repeats=[]
for repeat in out.repeat.unique():
for half in range(2):
half_repeat=tmp.loc[(tmp.repeat==repeat) & (tmp.half==half),["F","SigF","I","SigI","N","high$
half_repeat=half_repeat.unstack_anomalous()
half_repeat["half"]=half
half_repeat["half"]=half_repeat["half"].astype('MTZInt')
half_repeat["repeat"]=repeat
half_repeat["repeat"]=half_repeat["repeat"].astype('MTZInt')
half_repeats.append(half_repeat)
out2=rs.concat(half_repeats)
At https://github.com/Hekstra-Lab/reciprocalspaceship/blob/98409cc8552d22d03d6100c6c1dd01cc369e9647/reciprocalspaceship/dataset.py#L1144
I would like to propose replacing
with something like