project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
223 stars 45 forks source link

gemmi cif2mtz - what to do about negative sigmas #206

Closed CV-GPhL closed 2 years ago

CV-GPhL commented 2 years ago

The following PDB structures (<id>_<block>) contain unmerged data with some sigmas given as negative value:

4ij8_2 4qq4_2 4qqg_2 5epj_2 5epk_2 5epl_2 5i3l_2 5j39_2 5j3e_2 5juw_2 5lr6_2 5lsa_2
5m6q_2 5nhx_2 5suy_2 5tha_2 5uro_2 5w5v_2 5wp3_2 6asb_2 6asd_2 6bph_2 6ceu_2
6cev_2 6d0s_2 6d34_2 6fer_2 6few_2 6fex_2 6fil_2 6fin_2 6fio_2 6fiq_2 6n8u_2 6nd7_2
6ny8_2 6pi7_2 6r74_2 6r7e_2 6rnk_2 6svs_2 6tbb_2 6tbc_2 6tbe_2 6v04_2 6v0p_2 6v2d_2
6v2r_2 6v2s_2 6v41_2 6vcs_2 6xwg_2 6xwh_2 6yiz_2

This is most likely because e.g. XDS_ASCII.HKL data was converted to mmCIF as-is (without excluding those reflections where the negative sigma marked them as a misfit). Of course, the most likely correct fix is to exclude those reflections from getting into the mmCIF file in the first place - after all, for data coming e.g. out of AIMLESS we would never include those ROGUES reflections: they are not included into output file but in a separate file, while XDS using the same single file for both types of reflections (valid and to be kept versus misfit and to be rejected) is unfortunate in that respect.

Now, when it comes to going back from mmCIF to MTZ I would suggest rejecting those negative sigma reflections from conversion. That seems to me the most natural/sensible approach here. Alternatively (and maybe controlled via some flag) one could also flip their sign or keep the current behaviour.

?

wojdyr commented 2 years ago

Yes, the negative sigmas are not a good way of marking misfits, and they are formally incorrect (according to the mmCIF spec), but also there is no good way to include such reflections. It's not even planned yet, as discussed: https://groups.google.com/a/wwpdb.org/g/mmcif-data-proc/c/QFXYyFXa1C8

I'll think about removing such reflections. But generally the conversion, with exception of the status column, is done in a generic way according to the spec. So for this I'd probably need to add special handling of MTZ column type Q.

CV-GPhL commented 2 years ago

Agree: that is all a bit of a messy situation. Maybe the spec file could contain some syntax to define rejection - something like

intensity_sigma SIGI Q 0 <=0

where the last item defines that a whole record is dropped if that value is <= 0. This way gemmi does an as-is conversion as before but allows user control (at the moment I do that kind of selection within SFTOOLS after gemmi cif2mtz - which works fine, but is a separate step).

wojdyr commented 2 years ago

Having conditions in spec file would be an overkill, unless there are other uses for such filtering. I added option --skip-negative-sigma that filters out any reflections with a negative value in (any) Q type column.

CV-GPhL commented 2 years ago

Remember that we also have column types L (standard deviation of F+ or F-) and M (standard deviation of I+ or I-) that could/should be handled here. However: should we remove a full record if e.g. sigma of F+ is negative (while F- sigma is positive)?

See: https://www.ccp4.ac.uk/html/mtzformat.html#coltypes

wojdyr commented 2 years ago

Aren't misfits marked in unmerged data only, in type Q column?

CV-GPhL commented 2 years ago

Yes - but if you provide the flag --skip-negative-sigma I would expect it to work on any sigmas independent of the type of input (merged/unmerged). It might be very unlikely to nave negative sigmas in L and M column types (which are usually associated with merged data), so maybe not a real issue.

However: what about sigma values that are zero? This also makes no sense and indeed in old reflection files it was used to mark a MNF. E.g. in a set of our weekly runs we have those PDB entries with some sigmas being <= 0:

 WARNING: [3bxq] 527 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [3bxq] 206 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [3ckh.2] 358 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [3ckh.2] 111 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7ejb] 10 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7ejb] 5 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7ejy] 3 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7ejy] 5 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7eku.2] 1744 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7eku.2] 4237 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7f86] 2 reflections in columns FP/SIGFP have been set to 'absent' because sigma value (SIGFP) was zero, below zero or a MNF
 WARNING: [7f86] 323 reflections in columns F(+)/SIGF(+) have been set to 'absent' because sigma value (SIGF(+)) was zero, below zero or a MNF
 WARNING: [7f86] 272 reflections in columns F(-)/SIGF(-) have been set to 'absent' because sigma value (SIGF(-)) was zero, below zero or a MNF
 WARNING: [7f86] 321 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7f86] 272 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7n3r] 134 reflections in columns FP/SIGFP have been set to 'absent' because sigma value (SIGFP) was zero, below zero or a MNF
 WARNING: [7n3r] 80 reflections in columns F(+)/SIGF(+) have been set to 'absent' because sigma value (SIGF(+)) was zero, below zero or a MNF
 WARNING: [7n3r] 85 reflections in columns F(-)/SIGF(-) have been set to 'absent' because sigma value (SIGF(-)) was zero, below zero or a MNF
 WARNING: [7n3r] 57 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7n3r] 283 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7n3s] 30 reflections in columns FP/SIGFP have been set to 'absent' because sigma value (SIGFP) was zero, below zero or a MNF
 WARNING: [7n3s] 16 reflections in columns F(+)/SIGF(+) have been set to 'absent' because sigma value (SIGF(+)) was zero, below zero or a MNF
 WARNING: [7n3s] 22 reflections in columns F(-)/SIGF(-) have been set to 'absent' because sigma value (SIGF(-)) was zero, below zero or a MNF
 WARNING: [7n3s] 22 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7n3s] 6 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7n3v] 674 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7n3v] 717 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7o1e] 159 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7o1e] 117 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7o1f] 194 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7o1f] 219 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7o2o.4] 4 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7o2o.4] 17 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7o35] 37 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7o35] 43 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7o36] 405 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7o36] 272 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7o3k] 19 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7o3k] 7 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7o45] 1393 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7o45] 1307 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7p7f.2] 1331 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7p7f.2] 1368 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7p7g.2] 3 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7p7h.2] 340 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7p7h.2] 3024 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7ppz] 629 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7ppz] 436 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7q4i] 898 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7q4i] 806 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7qjg.5] 130 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7qjg.5] 84 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7qju.5] 221 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7qju.5] 132 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7qk4.5] 156 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7qk4.5] 508 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7rzb] 238 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7rzb] 566 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7shw] 824 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7shw] 844 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7ski] 1520 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7ski] 3829 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7skk] 699 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7skk] 686 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7twa] 46 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7twa] 172 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7vnh] 6 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7vnh] 1 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7w5s] 396 reflections in columns FP/SIGFP have been set to 'absent' because sigma value (SIGFP) was zero, below zero or a MNF
 WARNING: [7w5t] 60 reflections in columns FP/SIGFP have been set to 'absent' because sigma value (SIGFP) was zero, below zero or a MNF
 WARNING: [7w5v] 77 reflections in columns FP/SIGFP have been set to 'absent' because sigma value (SIGFP) was zero, below zero or a MNF
 WARNING: [7wo1] 1 reflections in columns I/SIGI have been set to 'absent' because sigma value (SIGI) was zero, below zero or a MNF
 WARNING: [7yzu] 604 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7yzu] 303 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF
 WARNING: [7z79] 56 reflections in columns I(+)/SIGI(+) have been set to 'absent' because sigma value (SIGI(+)) was zero, below zero or a MNF
 WARNING: [7z79] 74 reflections in columns I(-)/SIGI(-) have been set to 'absent' because sigma value (SIGI(-)) was zero, below zero or a MNF

(this uses our own, pre-gemmi tool to handle mmCIF reflection data and convert into MTZ, so YMMV). It doesn't skip whole reflections, but sets the sigma and associated amplitude/intensity to MNF instead. Is that something gemmi convert already does anyway?

wojdyr commented 2 years ago

It's documented that only Q columns are taken into account:

--skip-negative-sigma    Skip reflections with sigma<0 (in any Q column).

Special handling of sigma=0 would be more tricky, because 0.00 in cif file could mean 0.004.

CV-GPhL commented 2 years ago

Hi Marcin,

On Wed, Apr 20, 2022 at 06:33:24AM -0700, Marcin Wojdyr wrote:

It's documented that only Q columns are taken into account:

--skip-negative-sigma    Skip reflections with sigma<0 (in any Q column).

Correct - maybe provide some additional info/link (since the concept of a column type doesn't really exist at the point of a mmCIF file). I guess "--print-sepc" provides the information about mmCIF item and MTZ column type.

Special handling of sigma=0 would be more tricky, because 0.00 in cif file could mean 0.004.

Not quite I think: 0.00 in CIF file means exactly that. Yes, it might have come from an initial value of 0.004 or -0.004 (which is why we write CIF files in autoPROC in scientific notation), but at the point of cif2mtz you don't have that information. You can of course stick with such a zero value and rely on all subsequent programs (reading the resulting MTZ) of handling the data accordingly: most "old" programs will have checks not just for MNF but also sigma==0. But that is not guaranteed: newer programs might assume that the MTZ file given to them has never gone through a conversion (into CIF format and back) and that items either have a value or a MNF.

I would be inclined to at least write a warning message about the number of sigmas (Q, M and L) that are 0 if you stick with the current behaviiour - since that could point to a poor formatting (precision) during the CIF file generation and could alert developers or databases to potentially check their procedure. It would be even better if there was an option that would tell cif2mtz to set all data items (i.e. pairs where one is a sigma) to MNF if the sigma item is 0.

Cheers

Clemens

wojdyr commented 2 years ago

OK, I added "MTZ" in the text to make clear that Q is MTZ column type:

--skip-negative-sigma    Skip reflections with sigma<0 (in any MTZ Q column).

And I added option:

--zero-to-mnf            If value and sigma are 0, set both to MNF.

It checks if both sigma and the corresponding value are zero, because replacing non-zero value with MNF feels wrong.