Hierarchical clustering with different min-seq-id values

SimonKitSangChu commented 1 year ago

Expected Behavior

I wish to have multiple cluster runs, each with a different value of min-seq-id value. Ideally, the resultant representative form hierarchical subsets, i.e. representative sequences for min-seq-id = 50 form a subset of those for min-seq-id = 90, and vice versa.

Current Behavior

Neither 1) runs of different min-seq-id nor 2) runs of the same min-seq-id agree.

Related Issue(s)

https://github.com/soedinglab/MMseqs2/issues/663

milot-mirdita commented 1 year ago

You can achieve this by manually cascading your clustering steps. We have implemented something like this a long time ago for Uniclust (https://github.com/milot-mirdita/uniclust-pipeline/blob/9b37347d1e9bb0153efcb986b5d828f9f4a316f0/uniclust_workflow.sh#LL67C7-L128C24). However, this script is not up-to-date, but the general strategy is sound.

Regarding the consistency itself, please refer to this answer: https://github.com/soedinglab/MMseqs2/issues/115#issuecomment-423111623

SimonKitSangChu commented 1 year ago

I am unable to access the uniclust page (page not found). Would you mind sharing another link?

milot-mirdita commented 1 year ago

Sorry this one i meant:

https://github.com/soedinglab/uniclust-pipeline/blob/ec52afc312baab5178ce709caebb41bbae9bb3f0/uniclust_workflow.sh#L49-L133

milot-mirdita commented 1 year ago

You need to drop all instances of --max-seq-len and replace --target-cov 0.95 with -c 0.95 --cov-mode 1.

Lines 112-114 are quite expensive. I think that could be done smarter now, since we have the cluster reassignment mode in the Clustering workflow (which we didn't have back then).

You can replace that with (didn't try it out):

mmseqs cluster $INPUT "$TMPPATH/pref_step$STEP" "$TMPPATH/tmp_clu" --cluster-reassign 1 -c 0.8 --min-seq-id 0.3
$RUNNER mmseqs align $INPUT $INPUT "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" -e inf -a

SimonKitSangChu commented 1 year ago

Based on your suggestions, I have made a few little changes to the original script. In particular, I turned off OMP_PROC_BIND and changed the --max-seq-len and --target-cov 0.95 flags. I didn't yet replace line 112-114 with cluster reassignment.

#!/bin/bash -ex
[ "$#" -lt 2  ] && echo "Please provide <sequenceDB> <outDir>"  && exit 1;
[ ! -f "$1"   ] && echo "Sequence database $1 not found!"       && exit 1;
[   -d "$2"   ] && echo "Output directory $2 exists already!"   && exit 1;

function abspath() {
    if [ -d "$1" ]; then
        (cd "$1"; pwd)
    elif [ -f "$1" ]; then
        if [[ $1 == */* ]]; then
            echo "$(cd "${1%/*}"; pwd)/${1##*/}"
        else
            echo "$(pwd)/$1"
        fi
    fi
}

RELEASE="${3:-$(date "+%Y_%m")}"
SHORTRELEASE="${4:-$(date "+%y%m")}"

INPUT=$1
OUTDIR=$2/$RELEASE

TMPPATH=$OUTDIR/tmp
mkdir -p $TMPPATH

OUTDIR=$(abspath $OUTDIR)
TMPPATH=$(abspath $TMPPATH)

PREFILTER_COMMON="$COMMON"
PREFILTER_FRAG_PAR="--min-ungapped-score 100 --comp-bias-corr 0  -s 1 ${PREFILTER_COMMON}"
PREFILTER1_PAR="-c 0.9 --comp-bias-corr 1 -s 2 ${PREFILTER_COMMON}"
PREFILTER2_PAR="-c 0.8 --comp-bias-corr 1 -s 6  ${PREFILTER_COMMON}"
ALIGNMENT_COMMON="$COMMON -e 0.001 --max-seq-len 32768 --max-rejected 2147483647"
ALIGNMENT0_PAR="-c 0.9 --alignment-mode 2 --min-seq-id 0.9 --comp-bias-corr 0 ${ALIGNMENT_COMMON}"
ALIGNMENT1_PAR="-c 0.8 --alignment-mode 2 --min-seq-id 0.9 --comp-bias-corr 1 ${ALIGNMENT_COMMON}"
ALIGNMENT2_PAR="-c 0.8 --alignment-mode 3 --min-seq-id 0.3 --comp-bias-corr 1 ${ALIGNMENT_COMMON}"
CLUSTER_FRAG_PAR="--cluster-mode 2"
CLUSTER0_PAR="--cluster-mode 2"
CLUSTER1_PAR="--cluster-mode 0"
CLUSTER2_PAR="--cluster-mode 0"
SEARCH_PAR="$COMMON --profile --k-score 100"
CSTRANSLATE_PAR="-x 0.3 -c 4 -A $HHLIB/data/cs219.lib -D $HHLIB/data/context_data.lib -I ca3m -f -b"

SEQUENCE_DB="$OUTDIR/uniprot_db"

export OMP_PROC_BIND=false
mmseqs createdb "$INPUT" "${SEQUENCE_DB}"

STEP="_FRAG"
INPUT="${SEQUENCE_DB}"
$RUNNER mmseqs prefilter "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" ${PREFILTER_FRAG_PAR}
mmseqs rescorediagonal  "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" --min-seq-id 0.9 -c 0.95 --cov-mode 1
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_frag" ${CLUSTER_FRAG_PAR}
awk '{ print $1 }' "$TMPPATH/clu_frag.index" > "$TMPPATH/order_frag"
mmseqs createsubdb "$TMPPATH/order_frag" $INPUT "$TMPPATH/input_step_redundancy"

# filter redundancy
INPUT="$TMPPATH/input_step_redundancy"
mmseqs clusthash $INPUT "$TMPPATH/aln_redundancy" --min-seq-id 0.9
mmseqs cluster $INPUT "$TMPPATH/aln_redundancy" "$TMPPATH/clu_redundancy" ${CLUSTER_FRAG_PAR}
awk '{ print $1 }' "$TMPPATH/clu_redundancy.index" > "$TMPPATH/order_redundancy"
mmseqs createsubdb "$TMPPATH/order_redundancy" $INPUT "$TMPPATH/input_step0"

# go down to 90%
STEP=0
INPUT="$TMPPATH/input_step0"
# Remove the fragments from the prefilter, in order not to recompute prefilter
mmseqs createsubdb  "$TMPPATH/order_redundancy"  "$TMPPATH/pref_step_FRAG"  "$TMPPATH/pref_step_FRAG_filtered"
mmseqs filterdb "$TMPPATH/pref_step_FRAG_filtered" "$TMPPATH/pref_step$STEP" --filter-file "$TMPPATH/order_redundancy"
$RUNNER mmseqs align "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" ${ALIGNMENT0_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER0_PAR}
awk '{ print $1 }' "$TMPPATH/clu_step$STEP.index" > "$TMPPATH/order_step$STEP"
mmseqs createsubdb "$TMPPATH/order_step$STEP" $INPUT "$TMPPATH/input_step1"

# go down to 90% (this step is needed to create big clusters)
STEP=1
INPUT="$TMPPATH/input_step1"
$RUNNER mmseqs prefilter "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" ${PREFILTER1_PAR}
$RUNNER mmseqs align "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" ${ALIGNMENT1_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER1_PAR}

# create database unicluster 90% (we need to merge redundancy, step_0 and step_1)
mmseqs mergeclusters "${SEQUENCE_DB}" $OUTDIR/uniclust90_$RELEASE \
    "$TMPPATH/clu_frag" "$TMPPATH/clu_redundancy" $TMPPATH/clu_step0 $TMPPATH/clu_step1

awk '{ print $1 }' "$TMPPATH/clu_step$STEP.index" > "$TMPPATH/order_step$STEP"
mmseqs createsubdb "$TMPPATH/order_step$STEP" $INPUT "$TMPPATH/input_step2"

# now we cluster down to 30% sequence id to produce a 30% and 50% clustering
STEP=2
INPUT=$TMPPATH/input_step2
$RUNNER mmseqs prefilter $INPUT $INPUT "$TMPPATH/pref_step$STEP" ${PREFILTER2_PAR}
$RUNNER mmseqs align $INPUT $INPUT "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" ${ALIGNMENT2_PAR}

# cluster down to 50%
mmseqs filterdb "$TMPPATH/aln_step$STEP" "$TMPPATH/aln_uniclust50" \
    --filter-column 3 --filter-regex '(0\.[5-9][0-9]{2}|1\.000)'
mmseqs cluster $INPUT "$TMPPATH/aln_uniclust50" "$TMPPATH/clu_uniclust50" ${CLUSTER2_PAR}
mmseqs mergeclusters "${SEQUENCE_DB}" $OUTDIR/uniclust50_$RELEASE \
    "$TMPPATH/clu_frag" "$TMPPATH/clu_redundancy" $TMPPATH/clu_step0 $TMPPATH/clu_step1 $TMPPATH/clu_uniclust50

STEP=2
INPUT=$TMPPATH/input_step2
# cluster down to 30%
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_uniclust30" ${CLUSTER2_PAR}
mmseqs mergeclusters "${SEQUENCE_DB}" $OUTDIR/uniclust30_$RELEASE \
    "$TMPPATH/clu_frag" "$TMPPATH/clu_redundancy" $TMPPATH/clu_step0 $TMPPATH/clu_step1 $TMPPATH/clu_uniclust30

However, I experience another error in step 2 cluster.

linclust /home/outdir/2023_06/tmp/input_step2 /home/outdir/2023_06/tmp/clu_uniclust30/13156544047496927710/clu_redundancy /home/outdir/2023_06/tmp/clu_uniclust30/13156544047496927710/linclust --cluster-mode 0 --max-iter
ations 1000 --similarity-type 2 --threads 6 --compressed 0 -v 3 --sub-mat 'aa:blosum62.out,nucl:nucle
otide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-i
d 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-c
orr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0
 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --sco
re-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 -
-gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --alph-size aa:13,nucl:5 --kmer-per-seq 21
 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-pr
ob 0.9 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --
ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-re
use 0                                                                                                

kmermatcher /home/outdir/2023_06/tmp/input_step2 /home/outdir/2023_06/tmp/clu_uniclust30/13156544047496927710/linclust/2296487886038157831/pref --
sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0 --kmer-per-seq 
21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-mem
ory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 6 --compressed 0 -v 3  

kmermatcher /home/outdir/2023_06/tmp/input_step2 /home/outdir/2023_06/tmp/clu_uniclust30/13156544047496927710/linclust/2296487886038157831/pref --
sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0 --kmer-per-seq 
21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-
prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-mem
ory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 6 --compressed 0 -v 3 

Invalid non-numeric value for environment variable MMSEQS_CALL_DEPTH!
Error: kmermatcher died
Error: linclust died

My knowledge of mmseqs and its output format is limited and any help would be much appreciated.

I also wonder why there does not seem to be any file/flag directing the output of uniclust50 to uniclust30, which might not make it hierarchical. If I want to also implement a "uniclust70" -> uniclust50 -> uniclust30, which line(s) should I look for? Let me know if I have missed anything.

milot-mirdita commented 1 year ago

This issue is fixed in Git, but not part of a release yet. I think I fixed it shortly after the last release. You can download precompiled binaries at mmseqs.com/latest.

These 5 lines are intended to call the clust module which implements the actual clustering algorithm, not the whole cluster workflow with the searches etc:

mmseqs cluster $INPUT "$TMPPATH/aln_redundancy" "$TMPPATH/clu_redundancy" ${CLUSTER_FRAG_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER0_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER1_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_uniclust50" "$TMPPATH/clu_uniclust50" ${CLUSTER2_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_uniclust30" ${CLUSTER2_PAR}

should be

mmseqs clust $INPUT "$TMPPATH/aln_redundancy" "$TMPPATH/clu_redundancy" ${CLUSTER_FRAG_PAR}
mmseqs clust $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER0_PAR}
mmseqs clust $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER1_PAR}
mmseqs clust $INPUT "$TMPPATH/aln_uniclust50" "$TMPPATH/clu_uniclust50" ${CLUSTER2_PAR}
mmseqs clust $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_uniclust30" ${CLUSTER2_PAR}

For a 70% clustering you can adapt the 50% clustering strategy:

# cluster down to 70%
mmseqs filterdb "$TMPPATH/aln_step$STEP" "$TMPPATH/aln_uniclust70" \
    --filter-column 3 --filter-regex '(0\.[7-9][0-9]{2}|1\.000)'
mmseqs cluster $INPUT "$TMPPATH/aln_uniclust70" "$TMPPATH/clu_uniclust70" ${CLUSTER2_PAR}
mmseqs mergeclusters "${SEQUENCE_DB}" $OUTDIR/uniclust70_$RELEASE \
    "$TMPPATH/clu_frag" "$TMPPATH/clu_redundancy" $TMPPATH/clu_step0 $TMPPATH/clu_step1 $TMPPATH/clu_uniclust70

The idea here is to go directly from 90 to 30 to avoid any possible transitivity issues. In the MMseqs2 cascaded clustering you can get representative->member hits below the given thresholds after putting clusters from previous clustering steps together.

You can reconstruct the 50 (and a 70) clustering from the 30% alignments, and that's what it does (it does it with a roundabout regex, because we didn't have expression filters back then).

SimonKitSangChu commented 1 year ago

I built mmseqs from the latest Git and the issue is solved. There is now another warning.

there must be an error: 37 deleted from 4 that now is empty, but not assigned to a cluster
there must be an error: 49 deleted from 80 that now is empty, but not assigned to a cluster
there must be an error: 42 deleted from 52 that now is empty, but not assigned to a cluster
there must be an error: 42 deleted from 27 that now is empty, but not assigned to a cluster
there must be an error: 128 deleted from 120 that now is empty, but not assigned to a cluster
Total time: 0h 0m 0s 170ms

Interestingly it only happens at 50% level. Not 30% or 70%. Let me know if you identify it as a problem.

SimonKitSangChu commented 1 year ago

Another issue I notice is that not all representative sequences in 30% are found in 50% or 70%, i.e. the clustering is not "hierarchical".

I added the following lines to extract the representative sequences from the clustered.

mmseqs createsubdb $OUTDIR/uniclust50_${RELEASE} $SEQUENCE_DB $OUTDIR/uniclust50_${RELEASE}_rep
mmseqs convert2fasta $OUTDIR/uniclust50_${RELEASE}_rep uniclust50_rep.fasta

SimonKitSangChu commented 1 year ago

One more question. Which part of the script should be changed if I am interested in doing the same with foldseek?

milot-mirdita commented 1 year ago

Did you solve the issue with the clustering not being hierarchical? This should produce a hierarchical clustering.

Foldseek is quite a bit more tricky since it uses different modules will need a lot of parameter changes.

SimonKitSangChu commented 1 year ago

Yes, I am still experiencing the same issue, i.e. some sequences in clust30 do not show up in clust50 / clust70.

soedinglab / MMseqs2