soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 194 forks source link

Cluster with extractalignedregion result db #210

Open acpguedes opened 5 years ago

acpguedes commented 5 years ago

Hi, I think there is some inconsistency in extractalignedregion module to generate database with regions.

I'm trying to cluster aligned regions belonging to a db.

mmseqs search id_0.3cov_0.8evalue_0.001/pbps.neighbors.db id_0.3cov_0.8evalue_0.001/pbps.neighbors.db pbps.neighbors.rep30.ALN /tmp/ --threads 40 -s 7.5 -a -e 1e-10 --alt-ali 1 &> allvsall.search.log 
mmseqs convertalis id_0.3cov_0.8evalue_0.001/pbps.neighbors.db id_0.3cov_0.8evalue_0.001/pbps.neighbors.db pbps.neighbors.rep30.ALN pbps.neighbors.rep30.ALN.tsv --threads 40 
mmseqs extractalignedregion id_0.3cov_0.8evalue_0.001/pbps.neighbors.db id_0.3cov_0.8evalue_0.001/pbps.neighbors.db pbps.neighbors.rep30.ALN pbps.neighbors.rep30.ALN.SEQ --threads 40
mmseqs cluster pbps.neighbors.rep30.ALN.SEQ pbps.neighbors.rep30.ALN.SEQ.CLU /tmp/ -e 1e-10 -c 0 -a -s 7.5 --threads 40 

But it fails in the last step with this message

rescorediagonal pbps.neighbors.rep30.ALN.SEQ pbps.neighbors.rep30.ALN.SEQ /tmp//1456503536827778696/linclust/7100099902620488817/pref /tmp//1456503536827778696/linclust/7100099902620488817/pref_rescore1 --sub-mat blosum62.out --rescore-mode 0 --filter-hits 0 -e 1e-10 -c 0.5 -a 1 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --global-alignment 0 --db-load-mode 0 --threads 40 --compressed 0 -v 3 

[=================================================================] 204.52K 0s 215ms
Time for merging files: 0h 0m 0s 94ms
Time for processing: 0h 0m 0s 845ms
clust pbps.neighbors.rep30.ALN.SEQ /tmp//1456503536827778696/linclust/7100099902620488817/pref_rescore1 /tmp//1456503536827778696/linclust/7100099902620488817/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3 

Sequence db size != result db size
Error: Pre-clustering step died
Error: linclust died

I think it is happening because of the number of files generated by the extractalignedregion module. If it isn't, how do I get the aligned regions to use in a search? I'm trying to get aligned regions to cluster them to find possible domains in a set of protein.

Thanks in advance.

milot-mirdita commented 5 years ago

extractalignedregion is doing something unexpected/wrong when it encounters more than one hit per query.

This issue didn't occur in our taxonomy module since we only give it at most one alignment. You can get around the issue with something like this:

mmseqs filterdb pbps.neighbors.rep30.ALN pbps.neighbors.rep30.ALN.top1 --extract-lines 1
mseqs extractalignedregion id_0.3cov_0.8evalue_0.001/pbps.neighbors.db id_0.3cov_0.8evalue_0.001/pbps.neighbors.db pbps.neighbors.rep30.ALN.top1 pbps.neighbors.rep30.ALN.SEQ --threads 40
...

However I am not sure this is what you want.

acpguedes commented 5 years ago

Hi @milot-mirdita

I was needing more than one hit per query. Actually, I was trying to find domains/regions with a high score but since I have a lot of multidomain proteins I need all high score local alignments. So I remove redundancy at 30% of identity, searched all against all, I have removed self-matches and choose hits with very low e-value. I preferred to do this externally to MMSeqs.

Now I'm clustering these regions.

Anyway, I did test those commands anyway.

mmseqs filterdb pbps.neighbors.rep30.ALN pbps.neighbors.rep30.ALN.top1 --extract-lines 1
mmseqs extractalignedregion id_0.3cov_0.8evalue_0.001/pbps.neighbors.db id_0.3cov_0.8evalue_0.001/pbps.neighbors.db pbps.neighbors.rep30.ALN.top1 pbps.neighbors.rep30.ALN.SEQ --threads 40
mmseqs cluster pbps.neighbors.rep30.ALN.SEQ pbps.neighbors.rep30.ALN.SEQ.CLU -c 0 -e 1e-10 --threads 40 --max-seqs 300 --kmer-per-seq 80 -s 7.5 -a
mmseqs createtsv pbps.neighbors.rep30.ALN.SEQ pbps.neighbors.rep30.ALN.SEQ.CLU pbps.neighbors.rep30.ALN.SEQ.CLU.tsv

Also, I did use coordinates of matches in target and extract with any other sequence manipulation tool.

mmseqs createdb pbps.neighbors.regions.fa pbps.neighbors.regions.DB
mmseqs cluster pbps.neighbors.regions.DB pbps.neighbors.regions.CLU /tmp/ --threads 40 -c 0.8 --min-seq-id 0.8
mmseqs createtsv pbps.neighbors.regions.DB pbps.neighbors.regions.CLU pbps.neighbors.regions.CLU.tsv --threads 40
mmseqs result2repseq pbps.neighbors.regions.DB pbps.neighbors.regions.CLU pbps.neighbors.regions.CLU.REP 
ln -s pbps.neighbors.regions.DB_h pbps.neighbors.regions.CLU.REP_h 
ln -s pbps.neighbors.regions.DB_h.index pbps.neighbors.regions.CLU.REP_h.index 
mmseqs cluster pbps.neighbors.regions.CLU.REP pbps.neighbors.regions.CLU.REP.CLU tmp -c 0 -e 1e-10 --threads 40 --max-seqs 300 --kmer-per-seq 80 -s 7.5 -a 
mmseqs createtsv pbps.neighbors.regions.CLU.REP pbps.neighbors.regions.CLU.REP.CLU pbps.neighbors.regions.CLU.REP.CLU.tsv --threads 40 
mmseqs createseqfiledb pbps.neighbors.regions.CLU.REP pbps.neighbors.regions.CLU.REP.CLU pbps.neighbors.regions.CLU.REP.CLU.FADB --threads 40 
mmseqs apply pbps.neighbors.regions.CLU.REP.CLU.FADB pbps.neighbors.regions.CLU.REP.CLU.FADB.MSA --threads 40 -- clustalo -i - --threads=1 --iter=3 
mmseqs createseqfiledb pbps.neighbors.regions.CLU.REP pbps.neighbors.regions.CLU.REP.CLU pbps.neighbors.regions.CLU.REP.CLU.FADB --threads 40 --min-sequences 10

There are other strategies using mmseqs to find a conserved region (domains) in a set of proteins?

milot-mirdita commented 5 years ago

Your initial idea is sound. The only problem is that extractalignedregion does not really support it since it was built for one very specific purpose and should be generalized. I started working on that though I don't quite have time to finish it up.

I have an idea for a hack though:

mmseqs prefixid aln aln.tsv --tsv
awk '{ $1 = NR++; print $0; }' aln.tsv > aln_new.tsv
mmseqs tsv2db aln_new.tsv aln_new --output-dbtype 5
mmseqs extractalignedregion query target aln_new target_new --threads 1
awk '{ print $2; }' aln.tsv > target_ids.tsv
mmseqs createsubdb target_ids.tsv target_h target_new_h
sort -k2,2 target_new_h.index | awk '{ $1 = NR++; print $0; }' > target_new_h.index_new
mv -f target_new_h.index_new target_new_h.index
mmseqs cluster target_new clu tmp

I wrote this down without testing, but this should create a valid domain sequence database from the alignment result.

If you are interested in actually fixing the extractalignedregion module in the MMseqs2 codebase then I could also try guiding you through that.

acpguedes commented 5 years ago

@milot-mirdita this week I'm quite busy finishing up my writing. But sure I'm interested in fixing the extractalignedregion module. Actually, I'm glad about this idea.

milot-mirdita commented 5 years ago

I corrected a few mistakes in the hack above to generate a valid new sequence database that works with extractalignedregion.