pinder-org / pinder

PINDER: The Protein INteraction Dataset and Evaluation Resource
https://pinder-org.github.io/pinder/
Apache License 2.0
94 stars 7 forks source link

Expose biotite PDB back-end through PinderSystem #9

Closed ejmeitz closed 3 months ago

ejmeitz commented 4 months ago

I think it would be useful to expose the backend/pdb_engine option when constructing PinderSystem/Structure. I've found that the biotite loader recovers better from bizarre formatting and returns more useful data even if it is a bit slower.

The only entry point is through Structure.read_pdb which is a bit circuitous

danielkovtun commented 3 months ago

Hi @ejmeitz , thanks for the suggestion. Perhaps we can expose that control it via an environment variable? It could also be added to PinderSystem and Structure, but environment variable might be the simplest way to propagate to anywhere else in the code that uses PinderSystem/Structure.

That being said, do you know of any pinder IDs or PDB files that you had these issues with? Would be good to address them upstream in fastpdb if it's a bug.

ejmeitz commented 3 months ago

I do not have that information immediately available but I believe all the files had a similar issue. The last column with the element (e.g. N, O, C) had an extra space for the last couple rows which causes the pdb reader to break.

Biotite also can't read this but it is capable of inferring the species from the 2nd or 3rd column which has the atom type (e.g CA, CB). If you load one of these files with biotite it will throw a warning telling you that this is being done. There are the same files that fastpdb breaks on.

danielkovtun commented 3 months ago

Interestingly, I am seeing the inverse behavior re: biotite vs. fastpdb -- at least for those structures that had malformed PDB columns exceeding the dimensions and making B-factor values unreadable (the ones we discussed in #8)

pdb_file = get_pinder_location() / "pdbs/3hr2__A3_P02454--3hr2__A4_P02454.pdb"
arr = atom_array_from_pdb_file(pdb_file, backend="fastpdb", extra_fields=None)
arr = atom_array_from_pdb_file(pdb_file, backend="biotite", extra_fields=None)
    arr = model.get_structure(model=1, extra_fields=extra_fields)  # noqa
  File "/Users/danielkovtun/mamba-m1/envs/pinder/lib/python3.10/site-packages/biotite/structure/io/pdb/file.py", line 422, in get_structure
    occupancy[i] = float(line[_occupancy].strip())
ValueError: could not convert string to float: '4  1.0'

Perhaps something you could look into @padix-key?

Nonetheless, I have pushed an edit to expose pdb_engine as an argument to PinderSystem and Structure constructors. Let me know if this works!

https://github.com/pinder-org/pinder/pull/11/commits/fb21e2fc2378fe8167d303d6f49b2c8fdc183a40

ejmeitz commented 3 months ago

The issue here wasn't on the b-factor column, its just the atomic species. I was trying to calculate radius of gyration in biotite which requires the species to look up the mass. I'll try and find a specific file where this was happening.

padix-key commented 3 months ago

biotite.structure.io.pdb uses infer_elements() if missing elements are encountered: https://github.com/biotite-dev/biotite/blob/0404084283765baef765cc869d86ee07a898d82f/src/biotite/structure/io/pdb/file.py#L464-L474

fastpdb does not do this currently. I created https://github.com/biotite-dev/fastpdb/issues/16 for it.

However, I am still curious why this breaks fastpdb.PDBFile: I would simply expect an element annotation with empty strings. Could you post your input PDB file in the linked issue, to make sure I am actually tackling your problem?

ejmeitz commented 3 months ago

Yes it does leave the element type blank in this case it does not crash. Sorry if that was not clear. For the core functionality of pinder this is not really an issue but for the thing I was trying to do (radius of gyration) with the systems it was.

padix-key commented 3 months ago

OK. For now you can simply run infer_elements() on the AtomArray:

atom_array.element = struc.infer_elements(atom_array)