RS requires set_index(["H","K","L"]) to create a multi_index using rs.DataSet

DHekstra commented 5 months ago

Problem: Derek was trying to construct an rs.DataSet object from NumPy arrays (h, k, l, wave, etc.) for unmerged data. The following,

 ds = rs.DataSet({"H":h, "K":k, "L":l, "WAVE": wave, "dH":dh, "dK":dk, "dL":dl,
               "NPIX": npix_col, "BATCH": batch, "SigI": sigI_vals, "I": I_vals, "X":x, "Y":y},
               cell=args.ucell, spacegroup=sg_num, merged=False)
    ds["PARTIAL"] = True
    ds.write_mtz(args.mtz)

this throws the following error.

In [1]: ds.write_mtz("wow.mtz")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 1
----> 1 ds.write_mtz("wow.mtz")

File /global/cfs/cdirs/lcls/dermen/postcori/xtal/conda_base/lib/python3.8/site-packages/reciprocalspaceship/dataset.py:612, in DataSet.write_mtz(self, mtzfile, skip_problem_mtztypes, project_name, crystal_name, dataset_name)
    586 """
    587 Write DataSet to MTZ file.
    588 
   (...)
    608     Dataset name to assign to MTZ file
    609 """
    610 from reciprocalspaceship import io
--> 612 return io.write_mtz(
    613     self,
    614     mtzfile,
    615     skip_problem_mtztypes,
    616     project_name,
    617     crystal_name,
    618     dataset_name,
    619 )

File /global/cfs/cdirs/lcls/dermen/postcori/xtal/conda_base/lib/python3.8/site-packages/reciprocalspaceship/io/mtz.py:225, in write_mtz(dataset, mtzfile, skip_problem_mtztypes, project_name, crystal_name, dataset_name)
    191 def write_mtz(
    192     dataset,
    193     mtzfile,
   (...)
    197     dataset_name,
    198 ):
    199     """
    200     Write an MTZ reflection file from the reflection data in a DataSet.
    201 
   (...)
    223         Dataset name to assign to MTZ file
    224     """
--> 225     mtz = to_gemmi(
    226         dataset, skip_problem_mtztypes, project_name, crystal_name, dataset_name
    227     )
    228     mtz.write_to_file(mtzfile)
    229     return

File /global/cfs/cdirs/lcls/dermen/postcori/xtal/conda_base/lib/python3.8/site-packages/reciprocalspaceship/io/mtz.py:151, in to_gemmi(dataset, skip_problem_mtztypes, project_name, crystal_name, dataset_name)
    149         continue
    150     else:
--> 151         raise ValueError(
    152             f"column {c} of type {cseries.dtype} cannot be written to an MTZ file. "
    153             f"To skip columns without explicit MTZ dtypes, set skip_problem_mtztypes=True"
    154         )
    155 mtz.set_data(temp[columns].to_numpy(dtype="float32"))
    157 # Handle Unmerged data

ValueError: column index of type int64 cannot be written to an MTZ file. To skip columns without explicit MTZ dtypes, set skip_problem_mtztypes=True

Changing the last line by adding infer_mtz_datatypes().set_index(["H","K","L"], drop=True) solves the problem.

ds.infer_mtz_dtypes().set_index(["H","K","L"], drop=True).write_mtz(args.mtz)

I'll leave it to @dermen and @kmdalton to comment further.

dermen commented 5 months ago

Reproducer:

import reciprocalspaceship as rs
ds = rs.DataSet({"H":[0,1,2], "K":[3,4,5], "L":[6,7,8], "I": [9,10,11]}, cell=[79,79,38,90,90,90], spacegroup=96, merged=False)
ds.infer_mtz_dtypes().write_mtz("blargh.mtz")

Works:

import reciprocalspaceship as rs
ds = rs.DataSet({"H":[0,1,2], "K":[3,4,5], "L":[6,7,8], "I": [9,10,11]}, cell=[79,79,38,90,90,90], spacegroup=96, merged=False)
ds.infer_mtz_dtypes().set_index(["H", "K", "L"], drop=True).write_mtz("blargh.mtz")

Doesnt work:

import reciprocalspaceship as rs
ds = rs.DataSet({"H":[0,1,2], "K":[3,4,5], "L":[6,7,8], "I": [9,10,11]}, cell=[79,79,38,90,90,90], spacegroup=96, merged=False, index=["H", "K", "L"])
ds.infer_mtz_dtypes().write_mtz("blargh.mtz")

JBGreisman commented 5 months ago

@dermen -- thanks for providing a minimal example to work with!

This bug originates from here: https://github.com/rs-station/reciprocalspaceship/blob/2ef8ae44b3d3d456cf6ba2efdcd751c65c04330c/reciprocalspaceship/io/mtz.py#L135-L136

Even if the calling DataSet has proper dtypes set, if it has a non-MTZdtype in the index (such as a RangeIndex in the example), dataset.reset_index() ends up producing a column with a non-MTZdtype (here, int64):

In [15]: ds
Out[15]: 
   H  K  L     I
0  0  3  6   9.0
1  1  4  7  10.0
2  2  5  8  11.0

In [16]: ds.dtypes
Out[16]: 
H          HKL
K          HKL
L          HKL
I    Intensity
dtype: object

In [17]: temp = ds.reset_index()

In [18]: temp
Out[18]: 
   index  H  K  L     I
0      0  0  3  6   9.0
1      1  1  4  7  10.0
2      2  2  5  8  11.0

In [19]: temp.dtypes
Out[19]: 
index        int64
H              HKL
K              HKL
L              HKL
I        Intensity
dtype: object

To fix this, we need to make sure that io.write_mtz() (and io.to_gemmi()) support range-indexed (DataSets).

JBGreisman commented 5 months ago

I updated the title because this is not specific to unmerged DataSet objects

DHekstra commented 5 months ago

are you proposing something like

if type(ds.index) == pd.RangeIndex:
    % handle this case

JBGreisman commented 5 months ago

I'm sure there would be a way to make something like that work.

However, what I think will be easier would be to decorate the DataSet.write_mtz() and DataSet.to_gemmi() methods with the @range_indexed decorator. That decorator was written to avoid these sorts of issues so that functions can just assume the calling dataset is "range indexed". If we use the decorator, we should be able to just remove the reset_index() call. I think this should be fairly straightforward to implement, but I probably won't have time until later in the week/weekend.

rs-station / reciprocalspaceship

RS requires set_index(["H","K","L"]) to create a multi_index using rs.DataSet #255