ispyb / ispyb-database-modeling

4 stars 3 forks source link

Adding fields for anisotropic diffraction data #29

Open rhfogh opened 6 years ago

rhfogh commented 6 years ago

Justification

The STARANISO program is providing a new approach to describing diffraction limits of reflection data, taking anisotropy into account. Apart from the general approach to anisotropy it also gives a simplified description of this anisotropy via an ellipsoid fitted to the anisotropic cut-off surface which in turn can be used to calculate well-known statistical data merging descriptors.

These data are not autoPROC-specific. Diffraction anisotropy is a general phenomenon, which by its nature makes traditional statistics like resolution and completeness difficult to apply consistently without modification, once diffraction anisotropy is present and accounted for. In order to consider these effects appropriately, and to make the data accessible to all programs that (will) wish to take them into account, the anisotropy-derived values for resolution and completeness should be stored in ISPyB as general data, and not quarantined to summary files or program-specific tables.

The proposed changes affect two tables:

AutoProcScalingStatistics

Field Type Null Default
completenessSpherical float YES NULL
completenessEllipsoidal float YES NULL
anomalousCompletenessSpherical float YES NULL
anomalousCompletenessEllipsoidal float YES NULL

Comment: Completeness and anomalous completeness can be calculated in two different ways, either assuming isotropic data or taking into account anisotropy. Both approaches calculate the fraction of observed reflection within the 'measurable' volume. For spherical completeness this volume is assumed to be a sphere with a radius corresponding to the resolution of the data, whereas ellipsoidal completeness considers the ellipse defined by the diffraction limits. The new fields give both values, leaving the pre-existing fields ‘completeness’ and 'anomalousCompleteness’ to be filled with either value as considered appropriate, and to be used in existing applications. Ideally the overall ‘completeness’ fields would be removed and the various applications refactored to account for the new data available, but this does not seem realistic.

AutoProcScaling

Field Type Null Default
resolutionEllipsoidAxis11 float YES NULL
resolutionEllipsoidAxis12 float YES NULL
resolutionEllipsoidAxis13 float YES NULL
resolutionEllipsoidAxis21 float YES NULL
resolutionEllipsoidAxis22 float YES NULL
resolutionEllipsoidAxis23 float YES NULL
resolutionEllipsoidAxis31 float YES NULL
resolutionEllipsoidAxis32 float YES NULL
resolutionEllipsoidAxis33 float YES NULL
resolutionEllipsoidValue1 float YES NULL
resolutionEllipsoidValue2 float YES NULL
resolutionEllipsoidValue3 float YES NULL

Comment: STARANISO fits an ellipsoid to the anisotropic cut-off surface, describing this in terms of three principal axes (vectors of unit length) and resolution limits (in Angstrom) along each axis. The proposed fields give the direction cosines of the three principal axes of the ellipsoid in the standard orthonormal Cartesian frame associated with the crystal frame (e.g. the first axis has a triplet of directional cosines resolutionEllipsoidAxis11, resolutionEllipsoidAxis12, resolutionEllipsoidAxis13 and a corresponding length (resolution value) of resolutionEllipsoidValue1).

graeme-winter commented 5 years ago

Sorry for the ever so slow reply, Karl called this to my attention like yesterday 😕

My thoughts:

Though, I would also comment that the conventional way to store ellipsoidal data like this e.g. in crystallography is as a matrix of anisotropic B values (which would also be 6 not 12 values as the matrix is real symmetric)

 http://www.ebi.ac.uk/pdbe/docs/exchange/mmcif_rcsb_nmr.dic/Categories/atom_site_anisotrop.html

For an example in 222 symmetry noteworthy that Xtriage reports as:

       ----------Maximum likelihood anisotropic Wilson scaling----------

ML estimate of overall B_cart value:
  16.08,  0.00,  0.00
         27.78,  0.00
                29.58

Equivalent representation as U_cif:
   0.20, -0.00, -0.00
          0.35,  0.00
                 0.37

Eigen analyses of B-cart:
  -------------------------------------------------
  | Eigenvector | Value   | Vector                |
  -------------------------------------------------
  | 1           |  29.579 | ( 0.00,  0.00,  1.00) |
  | 2           |  27.782 | ( 0.00,  1.00, -0.00) |
  | 3           |  16.081 | ( 1.00, -0.00, -0.00) |
  -------------------------------------------------
ML estimate of  -log of scale factor:
  -3.44

In this case there are 3 unique numbers as the symmetry constrains the anisotropy - for an example in P1:

       ----------Maximum likelihood anisotropic Wilson scaling----------

ML estimate of overall B_cart value:
  20.97,  4.46,  1.48
         12.21,  1.61
                29.14

Equivalent representation as U_cif:
   0.24,  0.04, -0.07
          0.16, -0.07
                 0.37

Eigen analyses of B-cart:
  -------------------------------------------------
  | Eigenvector | Value   | Vector                |
  -------------------------------------------------
  | 1           |  29.755 | ( 0.24,  0.15,  0.96) |
  | 2           |  22.272 | ( 0.89,  0.35, -0.28) |
  | 3           |  10.296 | (-0.38,  0.92, -0.05) |
  -------------------------------------------------

Also, if we have completeness to this resolution limit, for spherical, I think we should also have spherical Rmerge etc. to give fair side-by-side comparisons otherwise this could be (and already is) misleading. In practice - if we keep the existing values for the spherical equivalents (as conventionally understood) then add elliptical-truncated values (optionally) then it will be much clearer.

rhfogh commented 5 years ago

What is actually calculated in Staraniso is what we are proposing here, i.e. the three diffraction limits and their associated directions (not identical to the directions of the anisotropic B ellipsoid, BTW). This does indeed require twelve numbers, of which six are redundant because the eigenvectors are normalised and constrained to be orthogonal. The advantage of storing the twelve values is that they are what users can understand directly, and therefore what they would want to see. It would be possible to construct a matrix with the diffraction limits as eigenvalues and the associated directions as eigenvectors, and such a matrix, being symmetric, could indeed be stored with only six numbers. The problem is that no one can interpret that matrix in their head - you would need code to convert it to a more intelligible representation whenever you wanted to display the results, be it in a UI or a summary file. Given that there are several applications that would be reading values out of ISPyB, this would require multiple conversion functions, with additional risk of bugs or misunderstandings causing discrepancies.

As long as the information is there, and that it is not unduly complicated to extract the values that we want to show the user, I suppose we are free to choose. One could argue that it is a space-versus-time trade-off. I would point out, though, that atomic B factors collectively take up a LOT of space and are not looked at as numbers, but only ever viewed in display programs. The diffraction limits require minimal space by comparison and need to be looked at directly by people. I would therefore favour the current proposal.

On Graemes second point I should point out that Staraniso does not actually do ellipsoidal truncation. The region where signals are considered to be in principle observable is determined from local I/sig(I) and is of irregular shape (it could only be ellipsoidal if the multiplicity was uniform, among other things). The ellipsoid is only used 1) to calculate the three diffraction limits for the information of the user, 2) to define the region used for calculating completeness.

As I understand it, Rmerge is calculated over the region of data thought to contain possible information, which will be used downstream. The difference comes because in one case there is the constraint that this region must be spherical, and in the other case there is not. There would be no point in calculating merging statistics for a lot of points that will anyway be rejected as not containing possible information.

The issue of correct – and fair – metrics is complex, since isotropic and anisotropic approaches make different selections of data to include. The results are not strictly comparable, and anyway produce different (numbers of) parameters. One could argue that the ultimate solution would be for all programs to produce whichever was decided to be the better scientific metric. Meanwhile Staraniso produces one completeness and three diffraction limits, neither of which is directly comparable to the isotropic completeness and diffraction limit. Quoting the anisotropic completeness with the best of the three diffraction limits would indeed be misleading. At GPhL we have been thinking whether we could think of an ‘equivalent resolution limit’ for anisotropic analysis that would be more or less comparable to isotropic diffraction limits. Until there is consensus on something like that, the best we can do is probably to give the users more than one diffraction limit, and educate them about what the various columns mean

Gerard, Clemens, and Rasmus

Anthchirp commented 5 years ago

Regarding the 12 values, could these be provided by a view on the database? This would make it application independent, and would also ensure we are not storing denormalized data. The database is already denormalized to some extent and we should strife to avoid any further denormalization. This is not about a space vs. time tradeoff, but about database normalization and integrity.

Similarly in my view adding completeness/anomalouscompleteness columns for ellipsoid data violates 1NF. I may be wrong here, and I am certain you can elaborate on this, but what would you understand by an inner shell anisotropic completeness?

If no such thing exists, then the correct solution is to add the value anisotropic to the ENUM scalingStatisticsType and not only can you then store all information in a backwards compatible form, with the existing completeness columns automatically taking on the anisotropic meaning, but also the RMerge etc. columns become available for software to use in the future which is desirable as @graeme-winter pointed out above. This approach is meaningful even if you have multiple diffraction limits differing from their isotropic meaning. I would argue even more so, since otherwise you would end up with a whole load of columns named anisotropicResolutionLimitLow etc.

If I am wrong with the above, and the concept of an inner shell anisotropic completeness does make sense, then a solution not violating 1NF is to add a single column analogous to anomalous to identify anisotropic entries.

graeme-winter commented 5 years ago

@rhfogh I misunderstood the title of the issue - "Adding fields for anisotropic diffraction data" - what you actually want is to store "What is actually calculated in Staraniso is what we are proposing here, i.e. the three diffraction limits and their associated directions" which is useful for autoPROC and not allowed for any other pipeline.

For info: I would think storing results from xtriage would also be useful however we would therefore need to get yet more columns added to achieve this since these are "staraniso only" columns?