rs-station / reciprocalspaceship

Tools for exploring reciprocal space
https://rs-station.github.io/reciprocalspaceship/
MIT License
28 stars 12 forks source link

stack/unstack_anomalous roundtrip does not work #97

Closed kmdalton closed 2 years ago

kmdalton commented 2 years ago

Calling DataSet.unstack_anomalous followed by DataSet.stack_anomalous does not always work in rs version 0.9.15. The following code verifies that this fails sometimes and succeeds others.

import gemmi
import reciprocalspaceship as rs
import numpy as np

cell = gemmi.UnitCell(10., 20., 30., 90., 90., 90.)
sg = gemmi.SpaceGroup(19)
dmin = 2.

h,k,l = rs.utils.generate_reciprocal_asu(cell, sg, dmin, anomalous=True).T
n = len(h)

ds = rs.DataSet({
        'H' : h,
        'K' : k,
        'L' : l,
        'F' : np.random.random(n),
        'loc' : np.random.random(n),
        'scale' : np.random.random(n),
    }, 
    spacegroup=sg, 
    cell=cell, 
    merged=True,
).infer_mtz_dtypes().set_index(['H', 'K', 'L'])

assert all(ds.keys() == ['F', 'loc', 'scale'])

unstacked = ds.unstack_anomalous()
print(unstacked.keys())
unstacked.stack_anomalous()

I find that the order of columns in unstacked is not always the same, despite consistent column ordering in ds. When the column order is Index(['F(+)', 'loc(+)', 'scale(+)', 'F(-)', 'loc(-)', 'scale(-)'], dtype='object'), the script succeeds. When the column order is Index(['F(+)', 'loc(+)', 'scale(+)', 'scale(-)', 'loc(-)', 'F(-)'], dtype='object'), it fails with the following traceback:

Traceback (most recent call last):                                                                                                                             
  File "stack_bug.py", line 31, in <module>                                                                                                                    
    unstacked.stack_anomalous()                                                                                                                                
  File ".../anaconda/envs/careless/lib/python3.8/site-packages/reciprocalspaceship/dataset.py", line 911, in stack_anomalous                    
    raise ValueError(                                                                                                                                          
ValueError: Corresponding labels in ['F(+)', 'loc(+)', 'scale(+)'] and ['scale(-)', 'loc(-)', 'F(-)'] are not the same dtype: FriedelSFAmplitude and MTZReal   

I don't know where this stochasticity is coming from, but it is probably somewhere in DataSet.stack_anomalous. My guess would be it has something to do with the (non?)determinism of pd.DataFrame.merge. I don't see any obvious place where the column order could be getting scrambled.

kmdalton commented 2 years ago

On second thought, DataFrame.merge(how='outer') is supposed to sort keys lexicographically. So, that is hopefully not the culprit.

JBGreisman commented 2 years ago

What version of pandas were you using here? I can reproduce this bug, but I do not have any stochasticity on my system (using the latest pandas v1.3.3).

JBGreisman commented 2 years ago

In my mind, this is a bug in unstack_anomalous(), not stack_anomalous(). Your snippet can be made to work for me with the following change to the last line:

plus_labels  = ["F(+)", "loc(+)", "scale(+)"]
minus_labels = ["F(-)", "loc(-)", "scale(-)"]
unstacked.stack_anomalous(plus_labels=plus_labels, minus_labels=minus_labels)

to explicitly provide the columns in the expected order. As per the docstring, stack_anomalous() expects the corresponding column labels to be given in the same order. However, unstack_anomalous() should be made to always output the column labels in a way that would be compatible with stack_anomalous()

kmdalton commented 2 years ago

My pandas version:

[ins] In [1]: import pandas as pd    

[ins] In [2]: pd.__version__         
Out[2]: '1.3.2'                      

I agree that the bug is in unstack_anomalous. Sorry if that wasn't clear.