Open pkeller opened 4 years ago
Could you test if that all would also work with a case like this:
Crystal 1 => Wavelength I => Sweep A => Images 1 - 900
Sweep B => Images 1801 - 2700
Wavelength II => Sweep A => Images 1 - 1800
Crystal 2 => Wavelength I => Sweep A => Images 1 - 3600
Wavelength III => Sweep A => Images 1 - 3600
And I then process/merge in the following way (SCALA/AIMLESS/MRFANA nomenclature)
NAME RUN 1 PROJECT x CRYSTAL 1 DATASET wvl-I
RUN 1 BATCH 1 TO 900
NAME RUN 2 PROJECT x CRYSTAL 1 DATASET wvl-I
RUN 2 BATCH 1801 TO 2700
NAME RUN 3 PROJECT x CRYSTAL 1 DATASET wvl-II
RUN 3 BATCH 1 TO 1200
NAME RUN 4 PROJECT x CRYSTAL 2 DATASET wvl-I
RUN 4 BATCH 1 TO 900
NAME RUN 5 PROJECT x CRYSTAL 2 DATASET wvl-I
RUN 5 BATCH 2701 TO 3600
NAME RUN 6 PROJECT x CRYSTAL 2 DATASET wvl-III
RUN 6 BATCH 1 TO 2700
(ie. multi-wavelength, multi-sweep and split sweeps). Everything with the same DATASET name will get merged together, so we have three distinct datasets.
Would that be possible to describe?
The first point is that IIUC a data section doesn't contain data from more than one crystal, in your example we initially get four unmerged data sections.
It seems that your first two sweeps are an inverse-beam pair, so they would need to be split up into their individual scans, something like this:
loop_
_pdbx_diffrn_scan.scan_id
_pdbx_diffrn_scan.crystal_id
_pdbx_diffrn_scan.image_id_begin
_pdbx_diffrn_scan.image_id_end
_pdbx_diffrn_scan.scan_angle_begin
_pdbx_diffrn_scan.scan_angle_end
1 1 1 100 0. 10.
2 1 1801 1900 180. 190.
3 1 101 2000 10. 20.
4 1 1901 2000 190. 200.
...
With the scan id being referred to by _pdbx_diffrn_unmerged_refln.scan_id in the unmerged reflection list.
This would make up one unmerged data section, let's call it 'ds-unmerged-e1-c1' (energy 1 and crystal 1)
To produce the merged data section at wavelength 1, we would have:
loop_
_pdbx_diffrn_merge_image_list.data_section_id
_pdbx_diffrn_merge_image_list.crystal_id
_pdbx_diffrn_merge_image_list.image_id_begin
_pdbx_diffrn_merge_image_list.image_id_end
'ds-unmerged-e1-c1' 1 1 2700
'ds-unmerged-e1-c2' 2 1 3600
Two additional merged data sections could be produced for the other two wavelengths from their corresponding unmerged data sections (which would record their individual merging statistics). This gives us three merged data sections, which correspond to your three merged datasets.
The link between these merged data sections and the more established part of the mmCIF dictionary can can be made through the implied link _pdbx_diffrn_data_section_correspondence.diffrn_id
-> _diffrn.id
(this is a good example of why we need to find a way of defining links that cross datablock boundaries BTW, but I'll open a separate issue for that).
One problem that this shows up is that there doesn't seem to be a way of specifying the collection wavelength of an unmerged data section independently of some processing having been done on it. Neither of these two seem to do the job:
Would something like this work: add a _pdbx_diffrn_scan.wavelength_id
and a _pdbx_diffrn_data_section_index.scan_id
?
As an added complexity: each processed sweep/scan can have a different cell and whenever a wavelength change occurs the value might not be exactly the same even if the idea is to combine data from the "same" wavelength.
This would then give maybe something like this (ignoring possible interleaved inverse-beam for the moment):
loop_
_pdbx_diffrn_scan.scan_id
_pdbx_diffrn_scan.crystal_id
_pdbx_diffrn_scan.wavelength_id
_pdbx_diffrn_scan.image_id_begin
_pdbx_diffrn_scan.image_id_end
_pdbx_diffrn_scan.scan_angle_begin
_pdbx_diffrn_scan.scan_angle_end
1 1 1 1 900 0. 90.
2 1 1 1801 2700 180. 270.
3 1 2 1 1200 60. 180.
4 2 3 1 900 0. 90.
5 2 3 1 2701 270. 360.
6 2 4 1 2700 0. 270.
loop_
_pdbx_diffrn_merge_wavelength_list.id
_pdbx_diffrn_merge_wavelength_list.wavelength
1 0.9100
2 0.9200
3 0.9105
4 0.8900
loop_
_pdbx_diffrn_unmerged_cell.ordinal
_pdbx_diffrn_unmerged_cell.crystal_id
_pdbx_diffrn_unmerged_cell.wavelength <<<< shouldn't that be a pointer to wavelength_list?
_pdbx_diffrn_unmerged_cell.cell_length_a
_pdbx_diffrn_unmerged_cell.cell_length_b
_pdbx_diffrn_unmerged_cell.cell_length_c
_pdbx_diffrn_unmerged_cell.cell_angle_alpha
_pdbx_diffrn_unmerged_cell.cell_angle_beta
_pdbx_diffrn_unmerged_cell.cell_angle_gamma
_pdbx_diffrn_unmerged_cell.Bravais_lattice
1 1 .9100 51.1 109.1 137.3 90.000 90.000 90.000 'oP'
2 1 .9100 51.2 109.0 137.5 90.000 90.000 90.000 'oP'
3 1 .9200 51.3 109.2 137.4 90.000 90.000 90.000 'oP'
4 2 .9105 51.2 108.9 137.4 90.000 90.000 90.000 'oP'
5 2 .9105 51.3 108.8 137.4 90.000 90.000 90.000 'oP'
6 2 .8900 51.2 109.1 137.1 90.000 90.000 90.000 'oP'
loop_
_pdbx_diffrn_data_section.id
_pdbx_diffrn_data_section.type_scattering
_pdbx_diffrn_data_section.type_merged
_pdbx_diffrn_data_section.type_scaled
_pdbx_diffrn_data_section.details
'ds-merged-wvlI' 'x-ray' 'true' 'true' 'something'
'ds-merged-wvlII' 'x-ray' 'true' 'true' 'something'
'ds-merged-wvlIII' 'x-ray' 'true' 'true' 'something'
'ds-unmerged-wvlI' 'x-ray' 'true' 'false' 'something'
'ds-unmerged-wvlII' 'x-ray' 'true' 'false' 'something'
'ds-unmerged-wvlIII' 'x-ray' 'true' 'false' 'something'
loop_
_pdbx_diffrn_data_section_index.data_section_id
_pdbx_diffrn_data_section_index.parent_data_section_id
_pdbx_diffrn_data_section_index.scan_id
'ds-merged-wvlI' 'ds-unmerged-wvlI' .
'ds-merged-wvlII' 'ds-unmerged-wvlII' .
'ds-merged-wvlIII' 'ds-unmerged-wvlIII' .
'ds-unmerged-wvlI' 'ds-unmerged-wvlI-1A' .
'ds-unmerged-wvlI-1A' . 1
'ds-unmerged-wvlI' 'ds-unmerged-wvlI-1B' .
'ds-unmerged-wvlI-1B' . 2
'ds-unmerged-wvlI' 'ds-unmerged-wvlI-2A' .
'ds-unmerged-wvlI-2A' 'ds-unmerged-wvlI-2A.1' .
'ds-unmerged-wvlI-2A.1' . 4
'ds-unmerged-wvlI-2A' 'ds-unmerged-wvlI-2A.2' .
'ds-unmerged-wvlI-2A.2' . 5
'ds-unmerged-wvlII' . 3
'ds-unmerged-wvlIII' . 6
@CV-GPhL What you are suggesting is in some respects the complete opposite to what is in the current example. In particular, the example has one data section per datablock, but you are proposing putting all the data sections in a single datablock. This is a fundamental difference, so I'm going to put part of my response in #3 (in which @epeisach advocates the one data section per datablock approach).
Your suggestion also raises a lower-level issue, aside from the datablock structure. You are defining a data section for each scan (using an informal naming convention to work around the fact that the current dictionary doesn't cater for this), while saying "ignoring possible interleaved inverse-beam for the moment". I think that inverse-beam collections need to be considered from the start, because that helps to clarify the distinction between a scan and a data section. The dictionary extension says:
Each scan consists of a contiguous series of images related by an axis of rotation.
The term "contiguous" is slightly ambiguous (see #4) in this context. If we interpret it to mean "contiguously collected", this definition is consistent with my definition of a scan from the terminology that I drew up three years ago at the request of @GB-GPhL and the MXCuBE steering committee. (That document can be found here: https://github.com/githubgphl/gphl-abstract-beamline/wiki/Terminology.) The distinction is that a scan relates purely to the operation of collecting images, whereas a data section is the object of a data processing operation, and has processing results/statistics associated with it.
With this in mind, defining a data section for each scan doesn't seem right to me - an inverse-beam collection should be expressed in either one or two data sections, depending on whether 'A' and 'B' halves are processed individually or together in the first processing step. There should be no further decomposition into smaller data sections. We could perhaps usefully define an additional item such as _pdbx_diffrn_scan.inverse_beam_component
to cater for the case where both of the inverse-beam halves are put into one data section.
It seems to me that there is some duplication in the way that unmerged data sections are related to the merged data sections that they contribute to. Referring to the latest example, the data in pdbx_diffrn_data_section_index:
and pdbx_diffrn_merge_crystal_list:
can be derived by uniquifying the data in the first two columns of pdbx_diffrn_merge_image_list:
A data block can contain at most one merged data section, because the definition of the category pdbx_diffrn_merged_refln does not allow individual merged reflections to be assigned to data sections. This means that all reflections in that category belong implicitly to the data section given by
_pdbx_diffrn_data_section.id
(which therefore can only have one value assigned in a data block, or at least only one value which has_pdbx_diffrn_data_section.type_merged 'true'
).It looks to me that applications don't need to write out
pdbx_diffrn_data_section_index
andpdbx_diffrn_merge_crystal_list
: they can be populated during the archiving process from the contents ofpdbx_diffrn_data_section
andpdbx_diffrn_merge_image_list
.Does this sound reasonable? Have I understood this correctly?