rs-station / careless

Merge X-ray diffraction data with Wilson's priors, variational inference, and metadata
MIT License
16 stars 6 forks source link

KeyError -- Index not found #145

Closed DHekstra closed 9 months ago

DHekstra commented 9 months ago

For lysozyme anomalous data, we are performing a scan retaining the first 90, 180, 360, etc. frames and running Careless. We are taking the unmerged intensities, retaining only the reflections for the first N frames, and the friedelizing these into "plus" and "minus" reflections.

Here are example rs.mtzdump results:

(laue-dials)[dhekstra@holy7c02409 effects_of_frame_numbers]$ rs.mtzdump integrated_NaI_3_04_frame_1_090_minus.mtz
Spacegroup: P43212
Extended Hermann-Mauguin name: P 43 21 2
Unit cell dimensions: 78.312 78.312 37.234 90.000 90.000 90.000

mtz.head():

          BATCH       I    SIGI      xcal      ycal  wavelength        BG     SIGBG  PARTIAL
H   K  L                                                                                    
-43 11 4      1  17.344  46.847  3484.614  1542.749       1.029  7822.822  2476.491    False
    12 3      1  51.900  50.260  3518.963  1620.925       1.038  9531.531  3031.220    False
    13 2      1  15.846  50.161  3543.795  1702.700       1.042  9640.233  3059.509    False
    14 1      1   2.234  49.878  3558.621  1786.493       1.039  9502.979  3026.672    False
-42 11 3      1  59.847  45.124  3431.465  1625.640       1.022  6977.960  2221.127    False

mtz.describe():

          BATCH           I      SIGI      xcal      ycal  wavelength         BG     SIGBG
count  109604.0    109604.0  109604.0  109604.0  109604.0    109604.0   109604.0  109604.0
mean     41.781     335.303    49.417  2056.666  1983.514       1.098   8472.989    2608.6
std      26.573     1896.94    10.409   743.643   925.654       0.065   1934.997   574.715
min         1.0     -59.904    13.928   405.873   385.143        0.97    475.352   149.419
25%        19.0      29.565    45.136  1477.474  1106.246       1.045   7235.718  2240.724
50%        38.0      62.546    49.147  1964.896  2156.781       1.089   8759.109   2715.63
75%        65.0     155.094    51.559  2721.705  2851.438       1.145   9739.043  3010.602
max        90.0  254269.344   508.007  3558.621  3560.209        1.25  17138.117  4408.022

mtz.dtypes:

BATCH             Batch
I             Intensity
SIGI             Stddev
xcal            MTZReal
ycal            MTZReal
wavelength      MTZReal
BG              MTZReal
SIGBG            Stddev
PARTIAL            bool
dtype: object
(laue-dials)[dhekstra@holy7c02409 effects_of_frame_numbers]$ rs.mtzdump integrated_NaI_3_04_frame_1_090_plus.mtz
Spacegroup: P43212
Extended Hermann-Mauguin name: P 43 21 2
Unit cell dimensions: 78.312 78.312 37.234 90.000 90.000 90.000

mtz.head():

           BATCH       I    SIGI      xcal      ycal  wavelength        BG     SIGBG  PARTIAL
H   K  L                                                                                     
-41 14  0      1  17.221  50.048  3523.382  1868.750       1.064  9377.918  2986.332    False
    15 -1      1  -3.082  39.384  3518.724  1953.602       1.050  5408.053  1713.605    False
    16 -2      1  89.414  43.250  3504.237  2035.637       1.032  5887.428  1863.713    False
-40 13  0      1  38.140  37.787  3435.772  1865.180       1.048  5054.999  1610.334    False
    14 -1      1  44.621  39.741  3432.866  1947.375       1.035  5407.095  1715.066    False

mtz.describe():

          BATCH           I      SIGI      xcal      ycal  wavelength         BG     SIGBG
count  178357.0    178357.0  178357.0  178357.0  178357.0    178357.0   178357.0  178357.0
mean     47.776     362.549    47.391  1948.483  1965.882       1.099    7762.96  2391.314
std      25.344    2186.993    12.004   884.901   774.362       0.065   2288.135   690.714
min         1.0     -98.562    12.697   407.095    386.03        0.97    342.978   109.563
25%        27.0      27.017    42.378  1119.304   1345.03       1.045   6362.886  1970.585
50%        49.0      59.011    47.222  2003.734  1959.354        1.09   8032.437  2487.829
75%        69.0     150.925    50.776  2725.096  2588.457       1.147   9395.184  2911.466
max        90.0  279936.031   535.401   3571.91  3556.091        1.25  29599.221  5745.147

mtz.dtypes:

BATCH             Batch
I             Intensity
SIGI             Stddev
xcal            MTZReal
ycal            MTZReal
wavelength      MTZReal
BG              MTZReal
SIGBG            Stddev
PARTIAL            bool
dtype: object

Note that the unit cells are the same. The highest res found in each file is also the same:

print(ds_plus.dHKL.min())
print(ds_plus.dHKL.max())

print(ds_minus.dHKL.min())
print(ds_minus.dHKL.max())

returns

1.7246876
33.626904
1.7246876
25.510649

Careless runs fine with reflections for 360 or more frames. With fewer frames, it fails with the following error:

Traceback (most recent call last):
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/bin/careless", line 8, in <module>
    sys.exit(main())
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/careless/careless.py", line 9, in main
    run_careless(parser)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/careless/careless.py", line 30, in run_careless
    inputs,rac = df.format_files(parser.reflection_files)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/careless/io/formatter.py", line 143, in format_files
    return self((load(f) for f in files))
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/careless/io/formatter.py", line 121, in __call__
    return self.finalize(data, rac)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/careless/io/formatter.py", line 595, in finalize
    refl_id = rac.to_refl_id(
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/careless/io/asu.py", line 172, in to_refl_id
    return self.asu_and_miller_lookup_table.loc[idx, 'id'].to_numpy('int').flatten()
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexing.py", line 1097, in __getitem__
    return self._getitem_tuple(key)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexing.py", line 1280, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexing.py", line 976, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexing.py", line 1077, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexing.py", line 1332, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexing.py", line 1272, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexing.py", line 1462, in _get_listlike_indexer
    keyarr, indexer = ax._get_indexer_strict(key, axis_name)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 2539, in _get_indexer_str
ict
    return super()._get_indexer_strict(key, axis_name)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5877, in _get_indexer_stri
ct
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 2559, in _raise_if_missin
g
    return super()._raise_if_missing(key, indexer, axis_name)
  File "/net/holy-nfsisilon/ifs/rc_labs/hekstra_lab/people/mklureza/opt/anaconda/envs/careless_gpu/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5941, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: '[(0, 42, 14, 7)] not in index'

FWIW, we will rerun with --dmin=1.73 and report back.

kmdalton commented 9 months ago

I think --dmin= will fix the issue, but probably we should look into this either way.

DHekstra commented 9 months ago

Yes, setting dmin to 1.73 resolved the problem. Somehow one reflection seems to have been included in the range of one dataset but not the other despite identical unit cell dimensions.

kmdalton commented 9 months ago

My guess is that this issue is to do with harmonic deconvolution and should be solved in v0.4.0. I'll have a look and respond with the conclusion.

kmdalton commented 9 months ago

Using the 90 frame HEWL files provided by @DHekstra, I was unable to replicate this issue with careless v0.3.8, v0.3.9, or v0.4.0.

kmdalton commented 9 months ago

this issue reproduces with --dmin=1.5

kmdalton commented 9 months ago

resolved by #146