steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
814 stars 100 forks source link

Mismatch 3Di Sequence Inferenced from FoldSeek Command and that from mini3di repo #336

Open KatarinaYuan opened 2 months ago

KatarinaYuan commented 2 months ago

Expected Behavior

I try to transform PDB structures into 3Di sequences. For mini3di (https://github.com/althonos/mini3di/), I used

pdb_path = "1xso.cif"
# mini3di
from Bio import PDB
if pdb_path.endswith(".pdb"):
    parser = PDB.PDBParser(QUIET=True)
else:
    parser = PDB.MMCIFParser(QUIET=True)
structure = parser.get_structure("test", pdb_path)
states = self.tokenizer_encoder.encode_chain(structure[0][chain_id])
seq_mini3di = self.tokenizer_encoder.build_sequence(states)

For FoldSeek, I used the command suggested by this issue #314

Current Behavior

mini3di results in "DKKKWWKDFPDPKTKIKIWDDDDLFKIKIWMKIFQADFDKKWKWWACAQDCPVTVVVSHFGAAPPDFWDFAQPDPRHGLTGDFIFGDDPRMTTDMDIHNSAGCDDPNRQQRIKMFIANAGQCGLPPPDPVSRGTSPRDDTRIMTGMHGDD"

and FoldSeek results in "DKKKWWKDFPDPKTKIKIWDDDDLFKIKIWMKIFQADFDKKWKWWACAQDCPVHVVVSHFGAAPPDFWDFAQPDPRHGLTGDFIFGDDPRMTTDMDIHNSAGCDDPNRQQRIKMFIANAGQCGLPPPDPVSRGTSPRDDTRIMTGMHDDD"

and the two resulted sequences are not identical in some residues.

Environment

I used foldseek==9-427df8a (the latest) and mini3di==0.1.1.

Thanks for help

milot-mirdita commented 2 months ago

Please open an issue in mini3di. It is a community project, which we don't run.

althonos commented 1 month ago

Hi @KatarinaYuan (and hi @milot-mirdita, thanks for pointing this out in the e-mail).

This difference is actually due to some atoms in the linked PDB file being disordered atoms:

ATOM     33  CB ACYS A   6      21.438  19.816  -0.079  0.50 12.85           C  
ATOM     34  CB BCYS A   6      21.428  19.604   0.838  0.50  8.66           C  

The way these are handled changes between Biopython and Foldseek:

This difference in behaviour cause different atom coordinates to be selected so in the end the 3di sequences are diffferent. I can add a flag to mini3di to take the last atom regardless of occupancy but my impression is that it's the better choice over taking the last atom in the order it appears in the source file?