soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.39k stars 195 forks source link

Question: recommendations for contaminant detection within transcriptome assembly #444

Closed ms-gx closed 3 years ago

ms-gx commented 3 years ago

I would like to use mmseqs taxonomy to detect contaminations within a transcriptome assembly.

The transcriptome is from a metazoan organism. Contaminations are mainly bacterial. I would like to use NR and NT databases for a start (and I successfully run first analyses with mmseqs2). But I can also build my own database. EDIT: Level of contamination is quite high and contamination is taxonomically quite diverse. Otherwise it would be rather easy.

First question: do you have specific recommendations when dealing with a transcriptome as the query? For example taking into account the rather short contigs and splicing.

Second question: There are no genome assemblies form closely related species available. What are your thoughts regarding nucl-nucl search VS translated_nucl-prot search in this case?

Third questions: In case I do a translated_nucl-prot search how do I deal with the fact that my queries are both from Euk. and Prok. origin? I mean regarding translation table. Would you worry about this?

milot-mirdita commented 3 years ago

First question: do you have specific recommendations when dealing with a transcriptome as the query? For example taking into account the rather short contigs and splicing.

I am not super familiar with transcriptomics datasets. Depending on the length of your queries you might want to turn of the early ORF filter (--orf-filter 0) as it can sometimes reject too many ORFs if the sequences are short.

We also have a different project that deals with contamination: https://github.com/martin-steinegger/conterminator Though that tool is currently only for all-vs-all genome contamination annotations. But Martin was planning to build a scan mode for one-vs-RefSeq/GenBank.

Second question: There are no genome assemblies form closely related species available. What are your thoughts regarding nucl-nucl search VS translated_nucl-prot search in this case?

Generally our nucl-nucl searching capabilities are less developed than than anything-prot. In the taxonomy assignment, nucl-nucl still uses top-hit taxon assignment instead of LCA. We haven't gotten around to thoroughly benchmark the nucl-nucl homology search and publish it (it does work quite well in the preliminary tests). And nucl-nucl homology search is limited to around ~80% sequence identity (which is usually more than enough for taxonomy though). Also building a taxonomy database for the NT is painful and the database might get extremely large (haven't tried it in a while though).

I'd be quite interested in what you find out in your experiments with it though. We do plan to develop this part further.

Third questions: In case I do a translated_nucl-prot search how do I deal with the fact that my queries are both from Euk. and Prok. origin? I mean regarding translation table. Would you worry about this?

By default, we extract fragments from any codon to a stop codon and repeat the procedure. So the translation table matters a bit less (only affected by alternative stop codons). And since we only do local alignments, the over-extended start will get chopped off through the alignment anyway.

ms-gx commented 3 years ago

Thanks for your detailed reply!

I am not super familiar with transcriptomics datasets. Depending on the length of your queries you might want to turn of the early ORF filter (--orf-filter 0) as it can sometimes reject too many ORFs if the sequences are short.

I'll try the --orf-filter 0 option as you suggested. Yes, of course in the case of a transcriptome the ORFs are often short if the assembler is not able to resolve all the isoforms properly.

Generally our nucl-nucl searching capabilities are less developed than than anything-prot. In the taxonomy assignment, nucl-nucl still uses top-hit taxon assignment instead of LCA. We haven't gotten around to thoroughly benchmark the nucl-nucl homology search and publish it (it does work quite well in the preliminary tests).

I run a nucl-nucl taxonomy assignment (against NT) and it performed well I think. I did not systematically check, but I did some cross-checks with blastn and it reported the exact same hits. Assuming blast to be the "gold-standard" I was quite impressed by the performance of your tool, because it was so much faster.

Also building a taxonomy database for the NT is painful and the database might get extremely large (haven't tried it in a while though).

I just did that and it worked. According to your instructions in the Wiki.

I'll let you know in case I have something interesting to report.