Unstack anomalous taking into account Careless repeats

DHekstra commented 1 year ago

At https://github.com/Hekstra-Lab/reciprocalspaceship/blob/98409cc8552d22d03d6100c6c1dd01cc369e9647/reciprocalspaceship/dataset.py#L1144

I would like to propose replacing

result = dataset_plus.merge(
            dataset_minus, how="outer", on=["H", "K", "L"], suffixes=suffixes
        )

with something like

    if "repeat" in dataset_plus.columns:
        result = dataset_plus.merge(
                dataset_minus, how="outer", on=["H", "K", "L","repeat"], suffixes=suffixes
        )
    else:
        result = dataset_plus.merge(
                dataset_minus, how="outer", on=["H", "K", "L"], suffixes=suffixes
        )

DHekstra commented 1 year ago

Ihis would be helpful for CCanom calculation on Careless xval#.mtz files using rsbooster.stats.ccanom.analyze_ccanom_mtz() directly. It is obviously not essential, since I could just use the make_halves_ccanom function in the same location first, but it seems undesirable for the default behavior to be to randomly pair repeats.

kmdalton commented 1 year ago

i have a feeling this is better handled at the top level by the user with a ds.groupby("repeat").apply(lambda x: x.unstack_anomalous(...)) or some similar formula. perhaps share the code you're working on and we can recommend a strategy.

DHekstra commented 1 year ago

the workaround in this case is quite simple--the problem only occurs if I stack and then unstack--then the repeats no longer match up. So the workaround is not to stack in the first place.

kmdalton commented 1 year ago

the stack / unstack_anomalous methods are not meant to be called on unmerged data. in this case, you may have several merged datasets which are concatenated inside one object. this means the miller indices have repeated values. hence, semantically they are the same as unmerged data. these methods should not be called in this situation, it may lead to a number of unpleasant side effects. the proper way to handle this is to scope each of the method calls within a groupby operator so that the miller indices are not redundant in each set.

for this reason, both methods raise a ValueError when called on unmerged data. the issue is that this is contingent on checking the dataset.merged attribute which is, in this case set to True which might have been a bad design choice in careless. on the rs end, we should probably directly test for redundant miller indices rather than rely on dataset.merged. I wonder what @JBGreisman thinks.

DHekstra commented 1 year ago

Good point. I'm going the groupby route for now, going off what is implemented in rsbooster.stats.anom.py.

DHekstra commented 1 year ago

my initial take was that the "repeat" was a sufficiently common case (for me, anyways) that we might want to implement a provision for it.

kmdalton commented 1 year ago

yeah i think this use case is going to be pretty unusual outside the hekstra lab. i don't think any other program besides careless makes this sort of data structure. this is maybe an indicator that careless shouldn't make these sorts of files :thinking:

JBGreisman commented 1 year ago

My personal feeling is that rs should be kept agnostic to careless decisions regarding column naming. More broadly, I try to avoid any hard-coded column names, because I think that's a recipe for problematic corner cases and "user surprise." There are certainly exceptions for H, K, and L and some internal column names, but generally I'd like to avoid hardcoding a case for repeat.

As @kmdalton said, in my mind, the xval mtz that gets output should be considered "unmerged" because it has repeat Miller indices. To my knowledge, CCanom can be computed from careless output without any use of {stack/unstack}_anomalous -- there is a commandline tool rs.ccanom provided in rsbooster that should be applicable to careless xval output without any modification needed. If you're running into cases of careless output that are unsupported by it, please let me know or file a ticket on rsbooster. rs.ccanom -h can be used to see the arguments it takes. Its internals can also be used as a framework for implementing your own function if that is more useful.

Let me know if there's anything else I can provide as far as useful snippets/templates

DHekstra commented 1 year ago

OK, after some consideration, I agree with all of these design decisions. I agree that the commandline application supports this case already well. Part of the issue came from me wanting to call a function in rsbooster.stats.ccanom directly from in a notebook (which works fine as long as one is aware of the above). My goal is to provide some better support for Careless users who are trying to understand how to interpret the Careless output.

DHekstra commented 8 months ago

For future reference, this is a productive work-around:

half_repeats=[]
for repeat in out.repeat.unique():
    for half in range(2):
        half_repeat=tmp.loc[(tmp.repeat==repeat) & (tmp.half==half),["F","SigF","I","SigI","N","high$
        half_repeat=half_repeat.unstack_anomalous()
        half_repeat["half"]=half
        half_repeat["half"]=half_repeat["half"].astype('MTZInt')
        half_repeat["repeat"]=repeat
        half_repeat["repeat"]=half_repeat["repeat"].astype('MTZInt')
        half_repeats.append(half_repeat)

out2=rs.concat(half_repeats)

rs-station / reciprocalspaceship

Unstack anomalous taking into account Careless repeats #193