steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
780 stars 99 forks source link

database cluster by pure structure similarity #151

Open Wangchentong opened 1 year ago

Wangchentong commented 1 year ago

@milot-mirdita @martin-steinegger

Hi, i would like to ask a technical detail question:

  i want to cluster a databse purely by structure similarity for my intention in another issue.

  In foldseek search, i observe there is a parameter misc: --alignment-type can control use aa,3di,aa+3di for alignment. But there is no this option in foldseek cluster command, i observe following option mitght relate to my purpose:

foldseek cluster -h
    prefilter:
    --seed-sub-mat TWIN              Substitution matrix file for k-mer generation [aa:3di.out,nucl:3di.out]
    --mask INT                       Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [0]
        --mask-prob FLOAT                Mask sequences is probablity is above threshold [0.900]
    align:
    --alignment-mode INT             How to compute the alignment:
                                      0: automatic
                                      1: only score and end_pos
                                      2: also start_pos and cov
                                      3: also seq.id [3]
    clust:
    --similarity-type INT            Type of score used for clustering. 1: alignment score 2: sequence identity [2]
    common:
    --sub-mat TWIN                   Substitution matrix file [aa:3di.out,nucl:3di.out]

Here is my current command

foldseek cluster afDB af80_clusterDB tmp -c 0.8 --cluster-reassign --mask 1 --alignment-mode 2 --similarity-type 1

Thanks to you guys for this amazing tool! Hope i can get opportunity to know this parameter well since i look up document and there's little description for these parameters. Any suggestion is appreciated. a lot !😉

martin-steinegger commented 1 year ago

--alignment-type should work in the clustering. It also shows up in my help text. What version are you using. I recommend using the most recent version since I properly implemented the 3Di only search in the most recent commit.

--similarity-type 1 has no impact on the clustering and --cluster-reassign is currently not implemented.

dtischer commented 1 year ago

I just dealt with this identical issue. I found foldseek behaved as you describe if I installed it with conda (conda install -c conda-forge -c bioconda foldseek). However, both of the precompiled binaries for Linux show the --alignment-type command with easy-cluster for me. (Note, there is https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz instead of https://mmseqs.com/foldseek/foldseek-linux-sse41.tar.gz as the readme says.)