[HowTo Question] Is it possible to merge mmseq search/taxonomy output from different runs?

JosepFAbril commented 4 years ago

I will appreciate if you can help me with a couple of questions regarding MMseqs2. I've been running it using both approaches, for the alignment against sequence dbs (mmseq search -> convertalis) and for the taxonomical binning (mmseq taxonomy -> taxonomyreport), either with a single sequences set or multiple sets after de-multiplexing barcodes from sequencing run.

The first question is if it is possible to get a taxonomy report directly from "search" option or the other way around, say here get search alignments (or convertalis-like) from "taxonomy" option, in order to avoid two runs of mmseqs (one for search and one for taxonomy).
The second question relates to merging the mmseq output, either from search or taxonomy options, after running mmseqs2 separately on each sequence bin. Is it possible to get convertalis or taxonomyreport from the output from several mmseqs "search" or "taxonomy" runs?

Thanks for your assistance on those questions... Josep F

Environment

Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): MMseqs2 from github, version: 5fcc48fbf4f6697e73e1e2a4b3f53c6cdf87e8f1
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): 4 AMD Opteron 6386 SE (avx/sse4_1/sse4_2), 64 cores, 512GB RAM
Operating system and version: Linux 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11)

martin-steinegger commented 4 years ago

The most recent version (git master) of mmseqs2 can print out taxonomical information using the --format-output options taxid,taxname,taxlineage.

  mmseqs search query target aln tmp 
  mmseqs convertalis query target aln aln.m8  --format-output query,target,taxid,taxname,taxlineage

If you want an taxonomical report than you can call LCA on the aln result. But you might wanna filter the result before to not consider all remote hits in the lowest common ancestor computation.

  mmseqs search query target aln tmp 
  # this call extracts the all highest scoring hits. It can be multiple hits per query
  mmseqs filterdb aln aln_top --beats-first --filter-column 4 --comparison-operator le
  # compute the lca of the best scoring hits
  mmseqs lca target aln_top alnLca

milot-mirdita commented 4 years ago

Generally you have to run convertalis and taxonomyreport etc separately on each result.

However, you can bundle more queries into one run by giving more input fasta/q files to createdb:

mmseqs createdb fasta1.fa fasta2.fa target
mmseqs search query target aln tmp

Now you can additionally give the qset column to convertalis to resolve from which input fasta file each search result came from.

mmseqs convertalis query target aln aln.m8 --format-output qset,query,target,etc...

You will get an output similar to this:

fasta1.fa q1 t5 ...
fasta1.fa q2 t7 ...
fasta2.fa q6 t1 ...
...

Btw, if you want a set of stickers (see https://twitter.com/thesteinegger/status/1201076220957315074), send me your address to milot at mirdita de.

JosepFAbril commented 4 years ago

I was looking for a command/option to merge the raw alignments or taxonomy files once they have been computed on different input sequence sets against the same database (and the use the convertalis or taxonomyreport commands on the merged output). Some of those alignments were already calculated, and I wondered if it was possible to avoid running again those into a merged input file on search/taxonomy commands. I really appreciate your suggestions and those by Martin and I will take into account for future searches with MMseq2.

Thanks again for your help... Josep F

However, you can bundle more queries into one run by giving more input fasta/q files to createdb:
mmseqs createdb fasta1.fa fasta2.fa target
mmseqs search query target aln tmp
Now you can additionally give the qset column to convertalis to resolve from which input fasta file each search result came from.
mmseqs convertalis query target aln aln.m8 --format-output qset,query,target,etc...
You will get an output similar to this:
fasta1.fa q1 t5 ...
fasta1.fa q2 t7 ...
fasta2.fa q6 t1 ...
...

soedinglab / MMseqs2

[HowTo Question] Is it possible to merge mmseq search/taxonomy output from different runs? #267

Environment