Closed DHekstra closed 3 months ago
Reproducer:
import reciprocalspaceship as rs
ds = rs.DataSet({"H":[0,1,2], "K":[3,4,5], "L":[6,7,8], "I": [9,10,11]}, cell=[79,79,38,90,90,90], spacegroup=96, merged=False)
ds.infer_mtz_dtypes().write_mtz("blargh.mtz")
Works:
import reciprocalspaceship as rs
ds = rs.DataSet({"H":[0,1,2], "K":[3,4,5], "L":[6,7,8], "I": [9,10,11]}, cell=[79,79,38,90,90,90], spacegroup=96, merged=False)
ds.infer_mtz_dtypes().set_index(["H", "K", "L"], drop=True).write_mtz("blargh.mtz")
Doesnt work:
import reciprocalspaceship as rs
ds = rs.DataSet({"H":[0,1,2], "K":[3,4,5], "L":[6,7,8], "I": [9,10,11]}, cell=[79,79,38,90,90,90], spacegroup=96, merged=False, index=["H", "K", "L"])
ds.infer_mtz_dtypes().write_mtz("blargh.mtz")
@dermen -- thanks for providing a minimal example to work with!
This bug originates from here: https://github.com/rs-station/reciprocalspaceship/blob/2ef8ae44b3d3d456cf6ba2efdcd751c65c04330c/reciprocalspaceship/io/mtz.py#L135-L136
Even if the calling DataSet
has proper dtypes set, if it has a non-MTZdtype in the index (such as a RangeIndex in the example), dataset.reset_index()
ends up producing a column with a non-MTZdtype (here, int64):
In [15]: ds
Out[15]:
H K L I
0 0 3 6 9.0
1 1 4 7 10.0
2 2 5 8 11.0
In [16]: ds.dtypes
Out[16]:
H HKL
K HKL
L HKL
I Intensity
dtype: object
In [17]: temp = ds.reset_index()
In [18]: temp
Out[18]:
index H K L I
0 0 0 3 6 9.0
1 1 1 4 7 10.0
2 2 2 5 8 11.0
In [19]: temp.dtypes
Out[19]:
index int64
H HKL
K HKL
L HKL
I Intensity
dtype: object
To fix this, we need to make sure that io.write_mtz()
(and io.to_gemmi()
) support range-indexed (DataSets).
I updated the title because this is not specific to unmerged DataSet
objects
are you proposing something like
if type(ds.index) == pd.RangeIndex:
% handle this case
I'm sure there would be a way to make something like that work.
However, what I think will be easier would be to decorate the DataSet.write_mtz()
and DataSet.to_gemmi()
methods with the @range_indexed
decorator. That decorator was written to avoid these sorts of issues so that functions can just assume the calling dataset is "range indexed". If we use the decorator, we should be able to just remove the reset_index()
call. I think this should be fairly straightforward to implement, but I probably won't have time until later in the week/weekend.
Problem: Derek was trying to construct an
rs.DataSet
object from NumPy arrays (h, k, l, wave, etc.) for unmerged data. The following,this throws the following error.
Changing the last line by adding
infer_mtz_datatypes().set_index(["H","K","L"], drop=True)
solves the problem.I'll leave it to @dermen and @kmdalton to comment further.