Use Plass for euk metagenomics data

liuxianghui commented 4 years ago

I want to extract euk genes/proteins from metagenomics data. I want to build a gene/protein catalog for euk genes. Seems that metaeuk is a reference guided approach ( based on mmseq2) and Plass is a denova approach ( not relying on reference protein sequences). I don't understand the statement in your paper about Plass on euk protein assembly. "Our chief limitation is that, unlike nucleotide assemblers, Plass cannot place the assembled protein sequences into genomic context. Furthermore, it cannot assemble intron-containing eukaryotic proteins, although, as shown, it can assemble eukaryotic proteins from transcriptome data. Another drawback is its inability to resolve homologous proteins from closely related strains or species with sequence identities above ~95%. However, the impact on the accuracy of predicted functions is low (Fig. 2) and bacterial phenotypes are determined more by the complement of horizontally acquired accessory genes than by minor variations in protein sequences." I understand the methods behind the mmseq2 and Plass are different.... but mmseq2 should able to handle the 'intron-containing eukaryotic proteins' ... Anyway,,, could you kindly suggest a good way to identify those euk proteins?? ( the prediction of euk genes from binned euk genomes are so troublesome...)

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Plass Output (for bugs)

Please make sure to also post the complete output of Plass. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "Plass Version:" when you execute Plass without any parameters):
Which Plass version was used (Statically-compiled, self-compiled, Homebrew, etc.):
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

martin-steinegger commented 4 years ago

@liuxianghui Plass extract all open reading frames from short reads and extending them through overlap detection. This works well for proteins that are encoded in an consecutive way. However, eukaryotes have introns so it is not possible to overlap the reads to extract the proteins. MetaEuk takes assemblies from meta-genomes as input and searches this assemblies six-frame translated against a reference sequence and predicts the proteins from the exons.

What makes the detection of eukaryotic genes hard? The fragmentation of the genomes?

liuxianghui commented 4 years ago

For bacteria, the usual approach is to assembly the reads into contigs and then use prodigal to predict the genes. However, this is not OK for euk, we have to do binning of genomes. Find those euk genomes and try the taxonomic assignment. Then use different tools like GeneMark-ES for prediction of gene for each genome. ( There is no tool to work with euk contigs like prodigal for bacteria ). GeneMark-ES use self-training model based on each genome to make prediction. Augustus have limited model and only apply for specify euk genomes. So I turned to metaeuk and Plass. I expect that they have help me to identify all the euk genes/proteins without going to the metagenomics binning of genomes and running GeneMark-ES. However, I am not sure how well metaeuk and Plass could do. I saw you did a lot work for marine and gut samples. Please kindly share your opinion on them. MetaEuk is claimed as a sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. So if my metagenome data does not contain novel euk species, most euk genes could be found by such a search using my assembled contigs, right? Also, PLass seems a denova approach, could it help to identify euk proteins,,,, You Plass nucleotide assembly seems to be not working well as Plass protein asembly. However, I can use mmseq2 to search my contigs against your Plass protein to identify the nucleotide genes. Does this make sense?

soedinglab / plass