Closed ms-gx closed 3 years ago
First question: do you have specific recommendations when dealing with a transcriptome as the query? For example taking into account the rather short contigs and splicing.
I am not super familiar with transcriptomics datasets. Depending on the length of your queries you might want to turn of the early ORF filter (--orf-filter 0
) as it can sometimes reject too many ORFs if the sequences are short.
We also have a different project that deals with contamination: https://github.com/martin-steinegger/conterminator Though that tool is currently only for all-vs-all genome contamination annotations. But Martin was planning to build a scan mode for one-vs-RefSeq/GenBank.
Second question: There are no genome assemblies form closely related species available. What are your thoughts regarding nucl-nucl search VS translated_nucl-prot search in this case?
Generally our nucl-nucl searching capabilities are less developed than than anything-prot. In the taxonomy assignment, nucl-nucl still uses top-hit taxon assignment instead of LCA. We haven't gotten around to thoroughly benchmark the nucl-nucl homology search and publish it (it does work quite well in the preliminary tests). And nucl-nucl homology search is limited to around ~80% sequence identity (which is usually more than enough for taxonomy though). Also building a taxonomy database for the NT is painful and the database might get extremely large (haven't tried it in a while though).
I'd be quite interested in what you find out in your experiments with it though. We do plan to develop this part further.
Third questions: In case I do a translated_nucl-prot search how do I deal with the fact that my queries are both from Euk. and Prok. origin? I mean regarding translation table. Would you worry about this?
By default, we extract fragments from any codon to a stop codon and repeat the procedure. So the translation table matters a bit less (only affected by alternative stop codons). And since we only do local alignments, the over-extended start will get chopped off through the alignment anyway.
Thanks for your detailed reply!
I am not super familiar with transcriptomics datasets. Depending on the length of your queries you might want to turn of the early ORF filter (--orf-filter 0) as it can sometimes reject too many ORFs if the sequences are short.
I'll try the --orf-filter 0
option as you suggested. Yes, of course in the case of a transcriptome the ORFs are often short if the assembler is not able to resolve all the isoforms properly.
Generally our nucl-nucl searching capabilities are less developed than than anything-prot. In the taxonomy assignment, nucl-nucl still uses top-hit taxon assignment instead of LCA. We haven't gotten around to thoroughly benchmark the nucl-nucl homology search and publish it (it does work quite well in the preliminary tests).
I run a nucl-nucl taxonomy assignment (against NT) and it performed well I think. I did not systematically check, but I did some cross-checks with blastn and it reported the exact same hits. Assuming blast to be the "gold-standard" I was quite impressed by the performance of your tool, because it was so much faster.
Also building a taxonomy database for the NT is painful and the database might get extremely large (haven't tried it in a while though).
I just did that and it worked. According to your instructions in the Wiki.
I'll let you know in case I have something interesting to report.
I would like to use
mmseqs taxonomy
to detect contaminations within a transcriptome assembly.The transcriptome is from a metazoan organism. Contaminations are mainly bacterial. I would like to use NR and NT databases for a start (and I successfully run first analyses with mmseqs2). But I can also build my own database. EDIT: Level of contamination is quite high and contamination is taxonomically quite diverse. Otherwise it would be rather easy.
First question: do you have specific recommendations when dealing with a transcriptome as the query? For example taking into account the rather short contigs and splicing.
Second question: There are no genome assemblies form closely related species available. What are your thoughts regarding nucl-nucl search VS translated_nucl-prot search in this case?
Third questions: In case I do a translated_nucl-prot search how do I deal with the fact that my queries are both from Euk. and Prok. origin? I mean regarding translation table. Would you worry about this?