rs-station / reciprocalspaceship

Tools for exploring reciprocal space
https://rs-station.github.io/reciprocalspaceship/
MIT License
28 stars 12 forks source link

DataSet.stack_anomalous Fails for duplicate column names #126

Closed kmdalton closed 2 years ago

kmdalton commented 2 years ago

If Friedel columns share the same base name with non-Friedel columns, stack_anomalous will fail with a cryptic error message.

Here is a minimal example

import reciprocalspaceship as rs
import numpy as np

dmin = 2.
cell = [10., 20., 30., 90., 90., 90.]
sg = 19

h,k,l = rs.utils.generate_reciprocal_asu(cell, sg, dmin, anomalous=False).T

ds = rs.DataSet({
        'H' : h,
        'K' : k,
        'L' : l,
        'I' : np.ones(len(h)),
        'I(+)' : np.ones(len(h)),
        'I(-)' : np.ones(len(h)),
    }, 
    merged=True, 
    cell=cell, 
    spacegroup=sg
).infer_mtz_dtypes().set_index(['H', 'K', 'L'])

print(ds)
print(ds.dtypes)

ds.stack_anomalous()

which outputs

user@computer:~$ python bug.py
I              Intensity
I(+)    FriedelIntensity
I(-)    FriedelIntensity
dtype: object
Traceback (most recent call last):
  File "bug.py", line 26, in <module>
    ds.stack_anomalous()
  File "/home/kmdalton/opt/reciprocalspaceship/reciprocalspaceship/dataset.py", line 907, in stack_anomalous
    F[label] = F[label].from_friedel_dtype()
  File "/home/kmdalton/opt/anaconda/envs/careless/lib/python3.8/site-packages/pandas/core/generic.py", line 5487, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataSet' object has no attribute 'from_friedel_dtype'
JBGreisman commented 2 years ago

This occurs because DataSet.stack_anomalous() removes the suffixes from anomalous columns during this method. It then eventually calls ds["I"], which would ordinarily return a DataSeries which has a from_friedel_dtype() method. However, since there are now two columns with the label "I", this call now returns a DataSet.

I'm considering this a bug, because it is certainly unexpected behavior. To resolve it, there are a few possible options in my mind:

  1. This behavior could be fixed by changing the column names later in the method. Although this would resolve this error, there would still be columns with the same name in the resulting DataSet. This will likely cause downstream problems.
  2. One could raise a ValueError when this case is detected. This can then give a more sensible error message like "This operation will result in two or more columns with the same name"
  3. We could do (2), but also add a only_keep_friedels=True flag that removes any non-Friedel columns with this operation. In the above example, this would drop the original "I" column, and output a DataSet with an "I" column generated from "I(+)" and "I(-)".
dennisbrookner commented 2 years ago

In the case of solution 2 (or perhaps 3 as well), I'd be happy to tackle this; it's similar to the change I made a little while back, and is still pretty fresh on my mind.

JBGreisman commented 2 years ago

Awesome -- sounds good! Kevin's example will also make a good template for writing a test. Reach out if you run into trouble.

I'm inclined to suggest starting with (2), because (3) can currently be implemented by the user by dropping desired columns.