Open SimonKitSangChu opened 1 year ago
You can achieve this by manually cascading your clustering steps. We have implemented something like this a long time ago for Uniclust (https://github.com/milot-mirdita/uniclust-pipeline/blob/9b37347d1e9bb0153efcb986b5d828f9f4a316f0/uniclust_workflow.sh#LL67C7-L128C24). However, this script is not up-to-date, but the general strategy is sound.
Regarding the consistency itself, please refer to this answer: https://github.com/soedinglab/MMseqs2/issues/115#issuecomment-423111623
I am unable to access the uniclust page (page not found). Would you mind sharing another link?
You need to drop all instances of --max-seq-len
and replace --target-cov 0.95
with -c 0.95 --cov-mode 1
.
Lines 112-114 are quite expensive. I think that could be done smarter now, since we have the cluster reassignment mode in the Clustering workflow (which we didn't have back then).
You can replace that with (didn't try it out):
mmseqs cluster $INPUT "$TMPPATH/pref_step$STEP" "$TMPPATH/tmp_clu" --cluster-reassign 1 -c 0.8 --min-seq-id 0.3
$RUNNER mmseqs align $INPUT $INPUT "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" -e inf -a
Based on your suggestions, I have made a few little changes to the original script. In particular, I turned off OMP_PROC_BIND
and changed the --max-seq-len
and --target-cov 0.95
flags. I didn't yet replace line 112-114 with cluster reassignment.
#!/bin/bash -ex
[ "$#" -lt 2 ] && echo "Please provide <sequenceDB> <outDir>" && exit 1;
[ ! -f "$1" ] && echo "Sequence database $1 not found!" && exit 1;
[ -d "$2" ] && echo "Output directory $2 exists already!" && exit 1;
function abspath() {
if [ -d "$1" ]; then
(cd "$1"; pwd)
elif [ -f "$1" ]; then
if [[ $1 == */* ]]; then
echo "$(cd "${1%/*}"; pwd)/${1##*/}"
else
echo "$(pwd)/$1"
fi
fi
}
RELEASE="${3:-$(date "+%Y_%m")}"
SHORTRELEASE="${4:-$(date "+%y%m")}"
INPUT=$1
OUTDIR=$2/$RELEASE
TMPPATH=$OUTDIR/tmp
mkdir -p $TMPPATH
OUTDIR=$(abspath $OUTDIR)
TMPPATH=$(abspath $TMPPATH)
PREFILTER_COMMON="$COMMON"
PREFILTER_FRAG_PAR="--min-ungapped-score 100 --comp-bias-corr 0 -s 1 ${PREFILTER_COMMON}"
PREFILTER1_PAR="-c 0.9 --comp-bias-corr 1 -s 2 ${PREFILTER_COMMON}"
PREFILTER2_PAR="-c 0.8 --comp-bias-corr 1 -s 6 ${PREFILTER_COMMON}"
ALIGNMENT_COMMON="$COMMON -e 0.001 --max-seq-len 32768 --max-rejected 2147483647"
ALIGNMENT0_PAR="-c 0.9 --alignment-mode 2 --min-seq-id 0.9 --comp-bias-corr 0 ${ALIGNMENT_COMMON}"
ALIGNMENT1_PAR="-c 0.8 --alignment-mode 2 --min-seq-id 0.9 --comp-bias-corr 1 ${ALIGNMENT_COMMON}"
ALIGNMENT2_PAR="-c 0.8 --alignment-mode 3 --min-seq-id 0.3 --comp-bias-corr 1 ${ALIGNMENT_COMMON}"
CLUSTER_FRAG_PAR="--cluster-mode 2"
CLUSTER0_PAR="--cluster-mode 2"
CLUSTER1_PAR="--cluster-mode 0"
CLUSTER2_PAR="--cluster-mode 0"
SEARCH_PAR="$COMMON --profile --k-score 100"
CSTRANSLATE_PAR="-x 0.3 -c 4 -A $HHLIB/data/cs219.lib -D $HHLIB/data/context_data.lib -I ca3m -f -b"
SEQUENCE_DB="$OUTDIR/uniprot_db"
export OMP_PROC_BIND=false
mmseqs createdb "$INPUT" "${SEQUENCE_DB}"
STEP="_FRAG"
INPUT="${SEQUENCE_DB}"
$RUNNER mmseqs prefilter "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" ${PREFILTER_FRAG_PAR}
mmseqs rescorediagonal "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" --min-seq-id 0.9 -c 0.95 --cov-mode 1
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_frag" ${CLUSTER_FRAG_PAR}
awk '{ print $1 }' "$TMPPATH/clu_frag.index" > "$TMPPATH/order_frag"
mmseqs createsubdb "$TMPPATH/order_frag" $INPUT "$TMPPATH/input_step_redundancy"
# filter redundancy
INPUT="$TMPPATH/input_step_redundancy"
mmseqs clusthash $INPUT "$TMPPATH/aln_redundancy" --min-seq-id 0.9
mmseqs cluster $INPUT "$TMPPATH/aln_redundancy" "$TMPPATH/clu_redundancy" ${CLUSTER_FRAG_PAR}
awk '{ print $1 }' "$TMPPATH/clu_redundancy.index" > "$TMPPATH/order_redundancy"
mmseqs createsubdb "$TMPPATH/order_redundancy" $INPUT "$TMPPATH/input_step0"
# go down to 90%
STEP=0
INPUT="$TMPPATH/input_step0"
# Remove the fragments from the prefilter, in order not to recompute prefilter
mmseqs createsubdb "$TMPPATH/order_redundancy" "$TMPPATH/pref_step_FRAG" "$TMPPATH/pref_step_FRAG_filtered"
mmseqs filterdb "$TMPPATH/pref_step_FRAG_filtered" "$TMPPATH/pref_step$STEP" --filter-file "$TMPPATH/order_redundancy"
$RUNNER mmseqs align "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" ${ALIGNMENT0_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER0_PAR}
awk '{ print $1 }' "$TMPPATH/clu_step$STEP.index" > "$TMPPATH/order_step$STEP"
mmseqs createsubdb "$TMPPATH/order_step$STEP" $INPUT "$TMPPATH/input_step1"
# go down to 90% (this step is needed to create big clusters)
STEP=1
INPUT="$TMPPATH/input_step1"
$RUNNER mmseqs prefilter "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" ${PREFILTER1_PAR}
$RUNNER mmseqs align "$INPUT" "$INPUT" "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" ${ALIGNMENT1_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER1_PAR}
# create database unicluster 90% (we need to merge redundancy, step_0 and step_1)
mmseqs mergeclusters "${SEQUENCE_DB}" $OUTDIR/uniclust90_$RELEASE \
"$TMPPATH/clu_frag" "$TMPPATH/clu_redundancy" $TMPPATH/clu_step0 $TMPPATH/clu_step1
awk '{ print $1 }' "$TMPPATH/clu_step$STEP.index" > "$TMPPATH/order_step$STEP"
mmseqs createsubdb "$TMPPATH/order_step$STEP" $INPUT "$TMPPATH/input_step2"
# now we cluster down to 30% sequence id to produce a 30% and 50% clustering
STEP=2
INPUT=$TMPPATH/input_step2
$RUNNER mmseqs prefilter $INPUT $INPUT "$TMPPATH/pref_step$STEP" ${PREFILTER2_PAR}
$RUNNER mmseqs align $INPUT $INPUT "$TMPPATH/pref_step$STEP" "$TMPPATH/aln_step$STEP" ${ALIGNMENT2_PAR}
# cluster down to 50%
mmseqs filterdb "$TMPPATH/aln_step$STEP" "$TMPPATH/aln_uniclust50" \
--filter-column 3 --filter-regex '(0\.[5-9][0-9]{2}|1\.000)'
mmseqs cluster $INPUT "$TMPPATH/aln_uniclust50" "$TMPPATH/clu_uniclust50" ${CLUSTER2_PAR}
mmseqs mergeclusters "${SEQUENCE_DB}" $OUTDIR/uniclust50_$RELEASE \
"$TMPPATH/clu_frag" "$TMPPATH/clu_redundancy" $TMPPATH/clu_step0 $TMPPATH/clu_step1 $TMPPATH/clu_uniclust50
STEP=2
INPUT=$TMPPATH/input_step2
# cluster down to 30%
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_uniclust30" ${CLUSTER2_PAR}
mmseqs mergeclusters "${SEQUENCE_DB}" $OUTDIR/uniclust30_$RELEASE \
"$TMPPATH/clu_frag" "$TMPPATH/clu_redundancy" $TMPPATH/clu_step0 $TMPPATH/clu_step1 $TMPPATH/clu_uniclust30
However, I experience another error in step 2 cluster
.
linclust /home/outdir/2023_06/tmp/input_step2 /home/outdir/2023_06/tmp/clu_uniclust30/13156544047496927710/clu_redundancy /home/outdir/2023_06/tmp/clu_uniclust30/13156544047496927710/linclust --cluster-mode 0 --max-iter
ations 1000 --similarity-type 2 --threads 6 --compressed 0 -v 3 --sub-mat 'aa:blosum62.out,nucl:nucle
otide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-i
d 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-c
orr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0
--db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --sco
re-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 -
-gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --alph-size aa:13,nucl:5 --kmer-per-seq 21
--spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-pr
ob 0.9 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --
ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-re
use 0
kmermatcher /home/outdir/2023_06/tmp/input_step2 /home/outdir/2023_06/tmp/clu_uniclust30/13156544047496927710/linclust/2296487886038157831/pref --
sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0 --kmer-per-seq
21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-mem
ory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 6 --compressed 0 -v 3
kmermatcher /home/outdir/2023_06/tmp/input_step2 /home/outdir/2023_06/tmp/clu_uniclust30/13156544047496927710/linclust/2296487886038157831/pref --
sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0 --kmer-per-seq
21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-
prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-mem
ory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 6 --compressed 0 -v 3
Invalid non-numeric value for environment variable MMSEQS_CALL_DEPTH!
Error: kmermatcher died
Error: linclust died
My knowledge of mmseqs and its output format is limited and any help would be much appreciated.
I also wonder why there does not seem to be any file/flag directing the output of uniclust50 to uniclust30, which might not make it hierarchical. If I want to also implement a "uniclust70" -> uniclust50 -> uniclust30, which line(s) should I look for? Let me know if I have missed anything.
This issue is fixed in Git, but not part of a release yet. I think I fixed it shortly after the last release. You can download precompiled binaries at mmseqs.com/latest.
These 5 lines are intended to call the clust
module which implements the actual clustering algorithm, not the whole cluster
workflow with the searches etc:
mmseqs cluster $INPUT "$TMPPATH/aln_redundancy" "$TMPPATH/clu_redundancy" ${CLUSTER_FRAG_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER0_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER1_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_uniclust50" "$TMPPATH/clu_uniclust50" ${CLUSTER2_PAR}
mmseqs cluster $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_uniclust30" ${CLUSTER2_PAR}
should be
mmseqs clust $INPUT "$TMPPATH/aln_redundancy" "$TMPPATH/clu_redundancy" ${CLUSTER_FRAG_PAR}
mmseqs clust $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER0_PAR}
mmseqs clust $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_step$STEP" ${CLUSTER1_PAR}
mmseqs clust $INPUT "$TMPPATH/aln_uniclust50" "$TMPPATH/clu_uniclust50" ${CLUSTER2_PAR}
mmseqs clust $INPUT "$TMPPATH/aln_step$STEP" "$TMPPATH/clu_uniclust30" ${CLUSTER2_PAR}
For a 70% clustering you can adapt the 50% clustering strategy:
# cluster down to 70%
mmseqs filterdb "$TMPPATH/aln_step$STEP" "$TMPPATH/aln_uniclust70" \
--filter-column 3 --filter-regex '(0\.[7-9][0-9]{2}|1\.000)'
mmseqs cluster $INPUT "$TMPPATH/aln_uniclust70" "$TMPPATH/clu_uniclust70" ${CLUSTER2_PAR}
mmseqs mergeclusters "${SEQUENCE_DB}" $OUTDIR/uniclust70_$RELEASE \
"$TMPPATH/clu_frag" "$TMPPATH/clu_redundancy" $TMPPATH/clu_step0 $TMPPATH/clu_step1 $TMPPATH/clu_uniclust70
The idea here is to go directly from 90 to 30 to avoid any possible transitivity issues. In the MMseqs2 cascaded clustering you can get representative->member hits below the given thresholds after putting clusters from previous clustering steps together.
You can reconstruct the 50 (and a 70) clustering from the 30% alignments, and that's what it does (it does it with a roundabout regex, because we didn't have expression filters back then).
I built mmseqs from the latest Git and the issue is solved. There is now another warning.
there must be an error: 37 deleted from 4 that now is empty, but not assigned to a cluster
there must be an error: 49 deleted from 80 that now is empty, but not assigned to a cluster
there must be an error: 42 deleted from 52 that now is empty, but not assigned to a cluster
there must be an error: 42 deleted from 27 that now is empty, but not assigned to a cluster
there must be an error: 128 deleted from 120 that now is empty, but not assigned to a cluster
Total time: 0h 0m 0s 170ms
Interestingly it only happens at 50% level. Not 30% or 70%. Let me know if you identify it as a problem.
Another issue I notice is that not all representative sequences in 30% are found in 50% or 70%, i.e. the clustering is not "hierarchical".
I added the following lines to extract the representative sequences from the clustered.
mmseqs createsubdb $OUTDIR/uniclust50_${RELEASE} $SEQUENCE_DB $OUTDIR/uniclust50_${RELEASE}_rep
mmseqs convert2fasta $OUTDIR/uniclust50_${RELEASE}_rep uniclust50_rep.fasta
One more question. Which part of the script should be changed if I am interested in doing the same with foldseek?
Did you solve the issue with the clustering not being hierarchical? This should produce a hierarchical clustering.
Foldseek is quite a bit more tricky since it uses different modules will need a lot of parameter changes.
Yes, I am still experiencing the same issue, i.e. some sequences in clust30 do not show up in clust50 / clust70.
Expected Behavior
I wish to have multiple cluster runs, each with a different value of min-seq-id value. Ideally, the resultant representative form hierarchical subsets, i.e. representative sequences for min-seq-id = 50 form a subset of those for min-seq-id = 90, and vice versa.
Current Behavior
Neither 1) runs of different min-seq-id nor 2) runs of the same min-seq-id agree.
Related Issue(s)
https://github.com/soedinglab/MMseqs2/issues/663