rs-station / careless

Merge X-ray diffraction data with Wilson's priors, variational inference, and metadata
MIT License
16 stars 6 forks source link

Running careless to scale merged mtz files #143

Closed ElkeDeZitter closed 9 months ago

ElkeDeZitter commented 9 months ago

Hi,

I would like to see how careless performs to scale merged data (what I usually would do with scaleit). I know the careless help message explicitly states that the input mtz file(s) should contain unmerged reflections. Nevertheless, I was wondering if it was possible.

I tried to run careless as follows:

careless mono \
  --spacegroups="P 21 21 21"  \
  --separate-files \
  'dHKL,Hobs,Kobs,Lobs'  \
  mtz1.mtz \
  mtz2.mtz \
  careless_merge/test

My mtz files only contain Miller indices and IMEAN and SIGIMEAN columns, which is also confirmed by rs.mtzdump:

[...]
mtz.dtypes:

IMEAN       Intensity
SIGIMEAN       Stddev
dtype: object

and after calling the read_mtz function in ipython:

In [2]: rs.read_mtz("mtz1.mtz").keys()
Out[2]: Index(['IMEAN', 'SIGIMEAN'], dtype='object')

Upon launching careless as described above, I get a large error message:

Careless version 0.3.9
Traceback (most recent call last):
  File "/Users/edezitter/miniconda3/envs/careless/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3653, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/edezitter/miniconda3/envs/careless/bin/careless", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/edezitter/miniconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main
    run_careless(parser)
  File "/Users/edezitter/miniconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 30, in run_careless
    inputs,rac = df.format_files(parser.reflection_files)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/edezitter/miniconda3/envs/careless/lib/python3.11/site-packages/careless/io/formatter.py", line 143, in format_files
    return self((load(f) for f in files))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/edezitter/miniconda3/envs/careless/lib/python3.11/site-packages/careless/io/formatter.py", line 120, in __call__
    data, rac = self.get_data_and_asu_collection(datasets)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/edezitter/miniconda3/envs/careless/lib/python3.11/site-packages/careless/io/formatter.py", line 83, in get_data_and_asu_collection
    ds = self.prep_dataset(ds, sg)
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/edezitter/miniconda3/envs/careless/lib/python3.11/site-packages/careless/io/formatter.py", line 309, in prep_dataset
    ds['image_id'] = ds[image_key]
                     ~~^^^^^^^^^^^
  File "/Users/edezitter/miniconda3/envs/careless/lib/python3.11/site-packages/pandas/core/frame.py", line 3761, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/edezitter/miniconda3/envs/careless/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3655, in get_loc
    raise KeyError(key) from err
KeyError: None

My understanding is that Careless can't find a column that it finds suitable for merging since the data is already merged. Is that correct? Would there be an easy way to circumvent this or should I stay with scaleit for these kind of scaling problems?

Thanks!

kmdalton commented 9 months ago

Hi @ElkeDeZitter,

Apologies for the late reply. It is a long, holiday weekend here.

Careless expects unmerged data and so it is looking for a column with the "BATCH" data type. This is typically used to record what image a reflection was observed on in an unmerged dataset. There is some nuance to what "BATCH" actually means in the case of rotation-method data, but that's the idea at least.

If you can just add a BATCH column to your MTZ which puts all the reflections on the same image, careless will be appeased. I was able to get careless to scale merged data by pre-processing the mtz file with the following script:

import reciprocalspaceship as rs

anomalous=True
mtz_in = 'hewl_0.mtz'
mtz_out = 'hewl_0_batch.mtz'

ds = rs.read_mtz(mtz_in)

if anomalous:
    ds = ds.stack_anomalous()
    ds = ds.dropna()

ds['BATCH'] = 1
ds = ds.infer_mtz_dtypes()
ds.merged = False

ds.write_mtz(mtz_out)

You need to set the anomalous variable to match your particular use case. MTZs with anomalous data are usually stored in a two column format with the miller index in the positive half of the reciprocal asymmetric unit. This sort of MTZ with (+) and (-) columns need to be handled differently from those without.

Let me know if this makes sense. I am happy to have another look tomorrow.

ElkeDeZitter commented 9 months ago

Hi @kmdalton , That seems to do the trick! Thank you very much.

kmdalton commented 9 months ago

Awesome! I'll close this issue, but I'd love to follow up on this idea later.