ihm fails on PDBDEV_00000088

aozalevsky commented 2 years ago

There are multiple points in the entry 88 where ihm fails during parsing:

Traceback (most recent call last):                                                                    
  File "/IHMValidation/example/../master/pyext/src/validation/__init__.py", line 74, in __init__
    self.system, = ihm.reader.read(fh, model_class=self.model)                                        
  File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 3298, in read       
    more_data = r.read_file()                                                                         
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 594, in read_file
    return self._read_file_c()                                                                        
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c  
    eof, more_data = _format.ihm_read_file(self._c_format)                                            
  File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

is the result of the sentence:

Typically, 14<B7>106 to 20<B7>106 photons were recorded at TAC channel-width of 14.1\xa0ps (IBH-5000U) or 8\xa0ps (EasyTau300).

The other error:

  File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode                                                                                                                                       
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 471889: invalid start byte

ogirinates from:

Sample conditions for the EPR experiments were 100 <B5>M protein in 100 mM NaCl, 50 mM Tris-HCl, 5 mM MgCl2, pH 7.4 dissolved in D2O with 12.5 % (v/v) glycerol-d8.

And finally, after deleting symbols causing previous errors:

Traceback (most recent call last):
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c
    eof, more_data = _format.ihm_read_file(self._c_format)
_format.FileFormatError: Wrong number of data values in loop (should be an exact multiple of the number of keys) at line 1940098

@benmwebb @brindakv I need your help on that.

benmwebb commented 2 years ago

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

You are assuming the file is UTF-8 encoded. Not all mmCIF files are. Usually I try reading as UTF-8 and fall back to latin-1 if that fails. See https://python-ihm.readthedocs.io/en/latest/reader.html. IIRC Sai was doing that for at least part of the validation pipeline.

Wrong number of data values in loop

I don't see that if I read the file as latin-1. My guess is you broke something during your edits.

That being said, there is an issue with this file:

loop_
_flr_fret_calibration_parameters.id
_flr_fret_calibration_parameters.phi_acceptor
_flr_fret_calibration_parameters.alpha
_flr_fret_calibration_parameters.alpha_sd
_flr_fret_calibration_parameters.gG_gR_ratio
_flr_fret_calibration_parameters.beta
_flr_fret_calibration_parameters.gamma
_flr_fret_calibration_parameters.delta
_flr_fret_calibration_parameters.a_b
1 '.' '.' '.' '.' '.' '.' '.' '.'

Those '.' entries should all be just plain . of course. @brindakv can fix that upstream.

benmwebb commented 2 years ago

You are assuming the file is UTF-8 encoded. Not all mmCIF files are. Usually I try reading as UTF-8 and fall back to latin-1 if that fails. See https://python-ihm.readthedocs.io/en/latest/reader.html. IIRC Sai was doing that for at least part of the validation pipeline.

See e.g. c9925742a

aozalevsky commented 2 years ago

Thank you for pointing out the docs. First time I saw this boilerplate in code I was curious where did this come from. The boilerplate is intact.

https://github.com/salilab/IHMValidation/blob/c3b01ca8869c09605cee1b5b244d2e7a8b0aed53/master/pyext/src/validation/__init__.py#L72-L77

I'll add some description of this part to the code later.

It looks like I missed a part of the traceback, though:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

During handling of the above exception, another exception occurred:

<...>
ValueError: could not convert string to float: '.'

So the UnicodeDecodeError exception was actually caught by the except and switched to ASCII but then failed again on the data part.

brindakv commented 2 years ago

The data in the mmCIF file is now updated: changed'.' to .

aozalevsky commented 2 years ago

Thanks, @brindakv now parsing works ok.

Another issue popped up, though. It is related to the software section:

#
loop_
_software.pdbx_ordinal
_software.name
_software.classification
_software.description
_software.version
_software.type
_software.location
1 FPS 'Model building' . . Program .
2 NMSim 'Model building' . . Program http://www.nmsim.de
3 'Amber 14' 'Model building' . . Program .
4 'DeerAnalysis2006' 'Data analysis' . . Program .

The code fails on missing links for the software. I can easily fix it, but wonder if it's ok to have missing versions and links?

benmwebb commented 2 years ago

The code fails on missing links for the software. I can easily fix it, but wonder if it's ok to have missing versions and links?

We need to provide links for each piece of software for the validation pages; see https://github.com/salilab/IHMValidation/blob/main/templates/references.csv for the file we maintain locally to fill in any missing links.

You can look up whether particular data items are required in the dictionary itself. e.g. https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_software.version.html (location and version are not required there).

aozalevsky commented 2 years ago

Thanks for the clarification! Yeah, I saw the references.csv. However, at the moment the code explicitly uses links from a cif file:

https://github.com/salilab/IHMValidation/blob/c3b01ca8869c09605cee1b5b244d2e7a8b0aed53/master/pyext/src/validation/__init__.py#L355

software here is an instance of the ihm.Software class. So if you say that the idea is to complement cif file with the data from references.csv, I'll modify this block.

benmwebb commented 2 years ago

See read_all_references in the same file for the function that reads the CSV.

salilab / IHMValidation

ihm fails on PDBDEV_00000088 #53