Open aozalevsky opened 2 years ago
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte
You are assuming the file is UTF-8 encoded. Not all mmCIF files are. Usually I try reading as UTF-8 and fall back to latin-1 if that fails. See https://python-ihm.readthedocs.io/en/latest/reader.html. IIRC Sai was doing that for at least part of the validation pipeline.
Wrong number of data values in loop
I don't see that if I read the file as latin-1. My guess is you broke something during your edits.
That being said, there is an issue with this file:
loop_
_flr_fret_calibration_parameters.id
_flr_fret_calibration_parameters.phi_acceptor
_flr_fret_calibration_parameters.alpha
_flr_fret_calibration_parameters.alpha_sd
_flr_fret_calibration_parameters.gG_gR_ratio
_flr_fret_calibration_parameters.beta
_flr_fret_calibration_parameters.gamma
_flr_fret_calibration_parameters.delta
_flr_fret_calibration_parameters.a_b
1 '.' '.' '.' '.' '.' '.' '.' '.'
Those '.'
entries should all be just plain .
of course. @brindakv can fix that upstream.
You are assuming the file is UTF-8 encoded. Not all mmCIF files are. Usually I try reading as UTF-8 and fall back to latin-1 if that fails. See https://python-ihm.readthedocs.io/en/latest/reader.html. IIRC Sai was doing that for at least part of the validation pipeline.
See e.g. c9925742a
Thank you for pointing out the docs. First time I saw this boilerplate in code I was curious where did this come from. The boilerplate is intact.
I'll add some description of this part to the code later.
It looks like I missed a part of the traceback, though:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte
During handling of the above exception, another exception occurred:
<...>
ValueError: could not convert string to float: '.'
So the UnicodeDecodeError
exception was actually caught by the except and switched to ASCII but then failed again on the data part.
The data in the mmCIF file is now updated: changed'.'
to .
Thanks, @brindakv now parsing works ok.
Another issue popped up, though. It is related to the software section:
#
loop_
_software.pdbx_ordinal
_software.name
_software.classification
_software.description
_software.version
_software.type
_software.location
1 FPS 'Model building' . . Program .
2 NMSim 'Model building' . . Program http://www.nmsim.de
3 'Amber 14' 'Model building' . . Program .
4 'DeerAnalysis2006' 'Data analysis' . . Program .
The code fails on missing links for the software. I can easily fix it, but wonder if it's ok to have missing versions and links?
The code fails on missing links for the software. I can easily fix it, but wonder if it's ok to have missing versions and links?
We need to provide links for each piece of software for the validation pages; see https://github.com/salilab/IHMValidation/blob/main/templates/references.csv for the file we maintain locally to fill in any missing links.
You can look up whether particular data items are required in the dictionary itself. e.g. https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_software.version.html (location and version are not required there).
Thanks for the clarification! Yeah, I saw the references.csv
. However, at the moment the code explicitly uses links from a cif file:
software
here is an instance of the ihm.Software
class. So if you say that the idea is to complement cif file with the data from references.csv
, I'll modify this block.
See read_all_references
in the same file for the function that reads the CSV.
There are multiple points in the entry 88 where
ihm
fails during parsing:is the result of the sentence:
The other error:
ogirinates from:
And finally, after deleting symbols causing previous errors:
@benmwebb @brindakv I need your help on that.