mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
186 stars 15 forks source link

Foldseek annotation of predicted 3Di structure gives no results #34

Open Jigyasa3 opened 3 weeks ago

Jigyasa3 commented 3 weeks ago

Hi everyone,

Thank you for an amazing tool! I am generating the 3Di structure of a protein sequence of interest to structurally annotate against Foldseek database. Here are the codes that I am using-

get the predicted 3Di structure

python /groups/rubin/databases/foldseek/scripts/predict_3Di_encoderOnly.py -i ${file1} -o ${OUT_DIR}/predicted_3Di_${file1} --model ${DB_DIR}/ #DB_DIR contains the alphafold_uniprot50

generate foldseek database

python /groups/rubin/databases/foldseek/scripts/generate_foldseek_db.py ${IN_DIR}/rep_protein1.faa ${OUT_DIR}/predicted_3Di_rep_protein1 rep_protein1

run foldseek

foldseek easy-search ${IN_DIR}/rep_protein1 ${DB_DIR}/alphafold_uniprot ${OUT_DIR}/rep_protein1_protT5.txt tmp --format-mode 4 --alignment-type 1

While the first two steps generate predicted 3Di and foldseek database, the foldseek output is empty. I ran this same protein on Foldseek web tool and it works, so I think I am doing something wrong in the first two steps. Any suggestions why this might be happening? I am attaching the protein sequence and the 3Di file to reproduce the results.

mheinzinger commented 3 weeks ago

Hi, Thanks a lot for your interest in our tool :) I took a quick look and can confirm your issue: your sequence indeed returns search hits when using the foldseek webserver with ProstT5-predicted 3Di (in fact eval etc indicate really good scoring for e.g. https://www.uniprot.org/uniprotkb/Q2ETE4/entry). I also compared the ProstT5 3Di-prediction from the webserver to your attached 3Di file and they are identical (so Step 1 # get the predicted 3Di structure) also works fine. You could manually check the resulting DB from step #2 (sorry ran out of time to debug this on my end - but you could check on your end by comparing it to the expected format described in e.g. section "Sequence database format" in the MMSeqs2 userguide). For step 3, I would recommend to remove anything that might cause any issues (even if its just as small as changing output format as you did with --format-mode 4 or alignment type. Hope this helps! -