Closed ejmeitz closed 3 months ago
Hi @ejmeitz , thanks for the suggestion. Perhaps we can expose that control it via an environment variable? It could also be added to PinderSystem
and Structure
, but environment variable might be the simplest way to propagate to anywhere else in the code that uses PinderSystem
/Structure
.
That being said, do you know of any pinder IDs or PDB files that you had these issues with? Would be good to address them upstream in fastpdb
if it's a bug.
I do not have that information immediately available but I believe all the files had a similar issue. The last column with the element (e.g. N, O, C) had an extra space for the last couple rows which causes the pdb reader to break.
Biotite also can't read this but it is capable of inferring the species from the 2nd or 3rd column which has the atom type (e.g CA, CB). If you load one of these files with biotite it will throw a warning telling you that this is being done. There are the same files that fastpdb
breaks on.
Interestingly, I am seeing the inverse behavior re: biotite vs. fastpdb -- at least for those structures that had malformed PDB columns exceeding the dimensions and making B-factor values unreadable (the ones we discussed in #8)
pdb_file = get_pinder_location() / "pdbs/3hr2__A3_P02454--3hr2__A4_P02454.pdb"
arr = atom_array_from_pdb_file(pdb_file, backend="fastpdb", extra_fields=None)
arr = atom_array_from_pdb_file(pdb_file, backend="biotite", extra_fields=None)
arr = model.get_structure(model=1, extra_fields=extra_fields) # noqa
File "/Users/danielkovtun/mamba-m1/envs/pinder/lib/python3.10/site-packages/biotite/structure/io/pdb/file.py", line 422, in get_structure
occupancy[i] = float(line[_occupancy].strip())
ValueError: could not convert string to float: '4 1.0'
Perhaps something you could look into @padix-key?
Nonetheless, I have pushed an edit to expose pdb_engine
as an argument to PinderSystem
and Structure
constructors. Let me know if this works!
https://github.com/pinder-org/pinder/pull/11/commits/fb21e2fc2378fe8167d303d6f49b2c8fdc183a40
The issue here wasn't on the b-factor column, its just the atomic species. I was trying to calculate radius of gyration in biotite which requires the species to look up the mass. I'll try and find a specific file where this was happening.
biotite.structure.io.pdb
uses infer_elements()
if missing elements are encountered: https://github.com/biotite-dev/biotite/blob/0404084283765baef765cc869d86ee07a898d82f/src/biotite/structure/io/pdb/file.py#L464-L474
fastpdb
does not do this currently. I created https://github.com/biotite-dev/fastpdb/issues/16 for it.
However, I am still curious why this breaks fastpdb.PDBFile
: I would simply expect an element
annotation with empty strings. Could you post your input PDB file in the linked issue, to make sure I am actually tackling your problem?
Yes it does leave the element type blank in this case it does not crash. Sorry if that was not clear. For the core functionality of pinder this is not really an issue but for the thing I was trying to do (radius of gyration) with the systems it was.
OK. For now you can simply run infer_elements()
on the AtomArray
:
atom_array.element = struc.infer_elements(atom_array)
I think it would be useful to expose the
backend
/pdb_engine
option when constructingPinderSystem
/Structure
. I've found that the biotite loader recovers better from bizarre formatting and returns more useful data even if it is a bit slower.The only entry point is through
Structure.read_pdb
which is a bit circuitous