pdbxmmcifwg / diffrn-data-set-extension

PDBx mmCIF dictionary extension for diffraction data sets
7 stars 4 forks source link

Duplication in specifying merged/unmerged data section relationships #2

Open pkeller opened 4 years ago

pkeller commented 4 years ago

It seems to me that there is some duplication in the way that unmerged data sections are related to the merged data sections that they contribute to. Referring to the latest example, the data in pdbx_diffrn_data_section_index:

loop_
_pdbx_diffrn_data_section_index.data_section_id
_pdbx_diffrn_data_section_index.parent_data_section_id
'ds-merged-1'  'ds-unmerged-1'
'ds-merged-1'  'ds-unmerged-2'

and pdbx_diffrn_merge_crystal_list:

loop_
_pdbx_diffrn_merge_crystal_list.data_section_id
_pdbx_diffrn_merge_crystal_list.crystal_id
'ds-unmerged-1'  1
'ds-unmerged-2'  2

can be derived by uniquifying the data in the first two columns of pdbx_diffrn_merge_image_list:

loop_
_pdbx_diffrn_merge_image_list.data_section_id
_pdbx_diffrn_merge_image_list.crystal_id
_pdbx_diffrn_merge_image_list.image_id_begin
_pdbx_diffrn_merge_image_list.image_id_end
'ds-unmerged-1'  1  1   2
'ds-unmerged-2'  2  1   5

A data block can contain at most one merged data section, because the definition of the category pdbx_diffrn_merged_refln does not allow individual merged reflections to be assigned to data sections. This means that all reflections in that category belong implicitly to the data section given by _pdbx_diffrn_data_section.id (which therefore can only have one value assigned in a data block, or at least only one value which has _pdbx_diffrn_data_section.type_merged 'true').

It looks to me that applications don't need to write out pdbx_diffrn_data_section_index and pdbx_diffrn_merge_crystal_list: they can be populated during the archiving process from the contents of pdbx_diffrn_data_section and pdbx_diffrn_merge_image_list.

Does this sound reasonable? Have I understood this correctly?

CV-GPhL commented 4 years ago

Could you test if that all would also work with a case like this:

 Crystal 1 => Wavelength I   => Sweep A => Images    1 -  900
                                Sweep B => Images 1801 - 2700
              Wavelength II  => Sweep A => Images    1 - 1800
 Crystal 2 => Wavelength I   => Sweep A => Images    1 - 3600
              Wavelength III => Sweep A => Images    1 - 3600

And I then process/merge in the following way (SCALA/AIMLESS/MRFANA nomenclature)

  NAME RUN 1 PROJECT x CRYSTAL 1 DATASET wvl-I
  RUN      1 BATCH     1 TO  900
  NAME RUN 2 PROJECT x CRYSTAL 1 DATASET wvl-I
  RUN      2 BATCH  1801 TO 2700
  NAME RUN 3 PROJECT x CRYSTAL 1 DATASET wvl-II
  RUN      3 BATCH     1 TO 1200
  NAME RUN 4 PROJECT x CRYSTAL 2 DATASET wvl-I
  RUN      4 BATCH     1 TO  900
  NAME RUN 5 PROJECT x CRYSTAL 2 DATASET wvl-I
  RUN      5 BATCH  2701 TO 3600 
  NAME RUN 6 PROJECT x CRYSTAL 2 DATASET wvl-III
  RUN      6 BATCH     1 TO 2700

(ie. multi-wavelength, multi-sweep and split sweeps). Everything with the same DATASET name will get merged together, so we have three distinct datasets.

Would that be possible to describe?

pkeller commented 4 years ago

The first point is that IIUC a data section doesn't contain data from more than one crystal, in your example we initially get four unmerged data sections.

It seems that your first two sweeps are an inverse-beam pair, so they would need to be split up into their individual scans, something like this:

loop_
 _pdbx_diffrn_scan.scan_id
  _pdbx_diffrn_scan.crystal_id
  _pdbx_diffrn_scan.image_id_begin
  _pdbx_diffrn_scan.image_id_end
  _pdbx_diffrn_scan.scan_angle_begin
  _pdbx_diffrn_scan.scan_angle_end
1 1     1    100    0.  10.
2 1  1801   1900  180. 190.
3 1   101   2000   10.  20.
4 1  1901   2000  190. 200.
...

With the scan id being referred to by _pdbx_diffrn_unmerged_refln.scan_id in the unmerged reflection list.

This would make up one unmerged data section, let's call it 'ds-unmerged-e1-c1' (energy 1 and crystal 1)

To produce the merged data section at wavelength 1, we would have:

loop_
_pdbx_diffrn_merge_image_list.data_section_id
_pdbx_diffrn_merge_image_list.crystal_id
_pdbx_diffrn_merge_image_list.image_id_begin
_pdbx_diffrn_merge_image_list.image_id_end
'ds-unmerged-e1-c1'  1    1  2700
'ds-unmerged-e1-c2'  2    1  3600

Two additional merged data sections could be produced for the other two wavelengths from their corresponding unmerged data sections (which would record their individual merging statistics). This gives us three merged data sections, which correspond to your three merged datasets.

The link between these merged data sections and the more established part of the mmCIF dictionary can can be made through the implied link _pdbx_diffrn_data_section_correspondence.diffrn_id -> _diffrn.id (this is a good example of why we need to find a way of defining links that cross datablock boundaries BTW, but I'll open a separate issue for that).

One problem that this shows up is that there doesn't seem to be a way of specifying the collection wavelength of an unmerged data section independently of some processing having been done on it. Neither of these two seem to do the job:

CV-GPhL commented 4 years ago

Would something like this work: add a _pdbx_diffrn_scan.wavelength_id and a _pdbx_diffrn_data_section_index.scan_id?

As an added complexity: each processed sweep/scan can have a different cell and whenever a wavelength change occurs the value might not be exactly the same even if the idea is to combine data from the "same" wavelength.

This would then give maybe something like this (ignoring possible interleaved inverse-beam for the moment):


loop_
  _pdbx_diffrn_scan.scan_id
  _pdbx_diffrn_scan.crystal_id
  _pdbx_diffrn_scan.wavelength_id
  _pdbx_diffrn_scan.image_id_begin
  _pdbx_diffrn_scan.image_id_end
  _pdbx_diffrn_scan.scan_angle_begin
  _pdbx_diffrn_scan.scan_angle_end
1 1 1    1    900    0.  90.
2 1 1 1801   2700  180. 270.
3 1 2    1   1200   60. 180.
4 2 3    1    900    0.  90.
5 2 3    1   2701  270. 360.
6 2 4    1   2700    0. 270.

loop_
  _pdbx_diffrn_merge_wavelength_list.id
  _pdbx_diffrn_merge_wavelength_list.wavelength
 1  0.9100
 2  0.9200
 3  0.9105
 4  0.8900

loop_
  _pdbx_diffrn_unmerged_cell.ordinal
  _pdbx_diffrn_unmerged_cell.crystal_id
  _pdbx_diffrn_unmerged_cell.wavelength    <<<< shouldn't that be a pointer to wavelength_list?
  _pdbx_diffrn_unmerged_cell.cell_length_a
  _pdbx_diffrn_unmerged_cell.cell_length_b
  _pdbx_diffrn_unmerged_cell.cell_length_c
  _pdbx_diffrn_unmerged_cell.cell_angle_alpha
  _pdbx_diffrn_unmerged_cell.cell_angle_beta
  _pdbx_diffrn_unmerged_cell.cell_angle_gamma
  _pdbx_diffrn_unmerged_cell.Bravais_lattice
    1 1 .9100  51.1 109.1 137.3 90.000 90.000 90.000 'oP'
    2 1 .9100  51.2 109.0 137.5 90.000 90.000 90.000 'oP'
    3 1 .9200  51.3 109.2 137.4 90.000 90.000 90.000 'oP'
    4 2 .9105  51.2 108.9 137.4 90.000 90.000 90.000 'oP'
    5 2 .9105  51.3 108.8 137.4 90.000 90.000 90.000 'oP'
    6 2 .8900  51.2 109.1 137.1 90.000 90.000 90.000 'oP'

loop_
  _pdbx_diffrn_data_section.id
  _pdbx_diffrn_data_section.type_scattering
  _pdbx_diffrn_data_section.type_merged
  _pdbx_diffrn_data_section.type_scaled
  _pdbx_diffrn_data_section.details
'ds-merged-wvlI'     'x-ray' 'true' 'true'  'something'
'ds-merged-wvlII'    'x-ray' 'true' 'true'  'something'
'ds-merged-wvlIII'   'x-ray' 'true' 'true'  'something'
'ds-unmerged-wvlI'   'x-ray' 'true' 'false' 'something'
'ds-unmerged-wvlII'  'x-ray' 'true' 'false' 'something'
'ds-unmerged-wvlIII' 'x-ray' 'true' 'false' 'something'

loop_
  _pdbx_diffrn_data_section_index.data_section_id
  _pdbx_diffrn_data_section_index.parent_data_section_id
  _pdbx_diffrn_data_section_index.scan_id
'ds-merged-wvlI'        'ds-unmerged-wvlI'      .
'ds-merged-wvlII'       'ds-unmerged-wvlII'     .
'ds-merged-wvlIII'      'ds-unmerged-wvlIII'    .
'ds-unmerged-wvlI'      'ds-unmerged-wvlI-1A'   .
'ds-unmerged-wvlI-1A'   .                       1
'ds-unmerged-wvlI'      'ds-unmerged-wvlI-1B'   .
'ds-unmerged-wvlI-1B'   .                       2
'ds-unmerged-wvlI'      'ds-unmerged-wvlI-2A'   .
'ds-unmerged-wvlI-2A'   'ds-unmerged-wvlI-2A.1' .
'ds-unmerged-wvlI-2A.1' .                       4
'ds-unmerged-wvlI-2A'   'ds-unmerged-wvlI-2A.2' .
'ds-unmerged-wvlI-2A.2' .                       5
'ds-unmerged-wvlII'     .                       3
'ds-unmerged-wvlIII'    .                       6
pkeller commented 4 years ago

@CV-GPhL What you are suggesting is in some respects the complete opposite to what is in the current example. In particular, the example has one data section per datablock, but you are proposing putting all the data sections in a single datablock. This is a fundamental difference, so I'm going to put part of my response in #3 (in which @epeisach advocates the one data section per datablock approach).

Your suggestion also raises a lower-level issue, aside from the datablock structure. You are defining a data section for each scan (using an informal naming convention to work around the fact that the current dictionary doesn't cater for this), while saying "ignoring possible interleaved inverse-beam for the moment". I think that inverse-beam collections need to be considered from the start, because that helps to clarify the distinction between a scan and a data section. The dictionary extension says:

Each scan consists of a contiguous series of images related by an axis of rotation.

The term "contiguous" is slightly ambiguous (see #4) in this context. If we interpret it to mean "contiguously collected", this definition is consistent with my definition of a scan from the terminology that I drew up three years ago at the request of @GB-GPhL and the MXCuBE steering committee. (That document can be found here: https://github.com/githubgphl/gphl-abstract-beamline/wiki/Terminology.) The distinction is that a scan relates purely to the operation of collecting images, whereas a data section is the object of a data processing operation, and has processing results/statistics associated with it.

With this in mind, defining a data section for each scan doesn't seem right to me - an inverse-beam collection should be expressed in either one or two data sections, depending on whether 'A' and 'B' halves are processed individually or together in the first processing step. There should be no further decomposition into smaller data sections. We could perhaps usefully define an additional item such as _pdbx_diffrn_scan.inverse_beam_component to cater for the case where both of the inverse-beam halves are put into one data section.