steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
842 stars 104 forks source link

The aligned part is one amino acid shorter than expected #42

Closed Pooryamb closed 1 year ago

Pooryamb commented 2 years ago

Hi,

Thanks for your great program. I am using foldseek to search some cif files against a database of pdb files. I see foldseek reports the alignment "one amino acid shorter than what it is supposed to report". For instance, I chopped the structure of A0A017S8D7 from position 17 to 229, and searched the whole structure against the chopped part. I saw the alignment is one amino acid shorter than what was expected.

Expected Behavior

I expected to see qend to be 229 and tend to be 213.

Current Behavior

qend = 228, and tend = 212

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders. A0A017S8D7_17_229.pdb.gz I assume you have downloaded the attached pdb.gz file and it is in your current working directory.

mkdir query
mkdir target
mv A0A017S8D7_17_229.pdb.gz target
wget https://alphafold.ebi.ac.uk/files/AF-A0A017S8D7-F1-model_v3.cif
mv AF-A0A017S8D7-F1-model_v3.cif query

foldseek createdb target/ Tdb --threads 1
foldseek createdb query/ Qdb --threads 1

foldseek search Qdb Tdb aln tmpFolder -a --threads 1
foldseek convertalis Qdb Tdb aln aln_tm \
--format-output "query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,qlen,tlen,evalue,bits,alntmscore" \
--threads 1

Please take a look at the aln_tm file

Spacepharer Output (for bugs)

Please make sure to also post the complete output of Spacepharer. You can use gist.github.com for large output.

Context

Providing context helps us

come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

martin-steinegger commented 2 years ago

Thank you for providing such a detailed report. It seems that through the cutting, the last 3di state changes from V to D. D and V have a negative substitution score (-3); therefore the last residue is not aligned. The first residue changed from A to D. However, this substitution score is positive (1) and is therefore aligned. I would recommend cutting the 3di sequences instead of truncating the PDB structure. Here is the full query and target 3di sequence overlapped:

  t_3di ----------------DEEALAEPPLCQVLVCVVCVVLVFFEDEPLVCVVFVLPLLVLLLVLLCDPPPPDPPDHHDALVSCCVNQVVGPYYDYPPRSLCVVRVCRNCVPHFYEHRDDDLVVRLVVCVVPLLVCLPDVVLVVCLVPDSGHSNSVSSSSCSSCCSQQVNDSDSVSNSVSSVVSVVVCVVPDDPLRYHYHGPPVDAQCSVCVSVVHDRDPDGRDDDPPPDPD-------------------------------------
  q_3di DDQPLVPQPFFAPDAAAEEALAEPPLCQVLVCVVCVVLVFFEDEPLVCVVFVLNLLRLLLSLLCDPPPPDPPDHHDALGSCRSNQRVGNYYDYPPRSLCVVRVCRNRVNHFYEHRDDDLVVRLVVCVVPLLVCLPDVVLVVCLVPDSGHSNSVSSSSCSSCCSQQVNDSDSVSNSVSSVVSVVVCVVPDDPLRYHYHGPPVDAQCSVCVSSVHDRDPDGRDDDPPDDPVVVVVVVVVVVVVVSVVVVVVVVVVVVVVVVVVVVVVD
Pooryamb commented 2 years ago

Thanks for your reply. I have the fasta file of the protein sequences and 3Di sequences ready. How can I make a database starting from these 2 files? @martin-steinegger

Pooryamb commented 2 years ago

I substituted each sequence of DB_ss with the sequence cut from the 3Di transformation of the whole structure, and it worked.