How to cluster proteins based on sturcture using protein fasta as input

hmdaahmd commented 1 year ago

Hello foldseek developer,

I'm trying to cluster protein based on structure. Apparently, the input file should be in PDB format, i used the mmseqs2 to creat db: $ cat extracted_proteome.faa | mmseqs createdb stdin sequenceDB

Then , when i try to cluster by Foldseek: $ foldseek easy-cluster sequenceDB sp extclu tmp it gives me error: .... .... .... Clustering mode: Set Cover Sort entries Find missing connections Found 0 new connections. Reconstruct initial order Add missing connections

Time for read in: 0h 0m 1s 9ms tmp/12559612910676922942/clu_tmp/14063753230906200045/clustering.sh: line 123: 188866 Segmentation fault (core dumped) "$MMSEQS" clust "$INPUT" "${TMP_PATH}/pref_rescore1" "${TMP_PATH}/pre_clust" ${CLUSTER_PAR} Error: Pre-clustering step died Error: Search died

I don't have a lot of bioinformatic experience, could you please help in this issue.. like how to create PDB file from fasta file, and how to cluster them in a correct way? Thank you in advance! Hamed

martin-steinegger commented 1 year ago

Foldseek is a protein structure-based alignment and cluster tool. It does not perform any structure predictions. You would need to turn your protein sequences in the fasta into structures e.g. using some structure predictor like AlphaFold2, ColabFold, ESMFold, ... and then cluster them using foldseek.

hmdaahmd commented 1 year ago

Great! Thank you Martin.

steineggerlab / foldseek

How to cluster proteins based on sturcture using protein fasta as input #92