soedinglab / plass

sensitive and precise assembly of short sequencing reads
https://plass.mmseqs.com
GNU General Public License v3.0
132 stars 14 forks source link

Use Plass for euk metagenomics data #28

Open liuxianghui opened 4 years ago

liuxianghui commented 4 years ago

I want to extract euk genes/proteins from metagenomics data. I want to build a gene/protein catalog for euk genes. Seems that metaeuk is a reference guided approach ( based on mmseq2) and Plass is a denova approach ( not relying on reference protein sequences). I don't understand the statement in your paper about Plass on euk protein assembly. "Our chief limitation is that, unlike nucleotide assemblers, Plass cannot place the assembled protein sequences into genomic context. Furthermore, it cannot assemble intron-containing eukaryotic proteins, although, as shown, it can assemble eukaryotic proteins from transcriptome data. Another drawback is its inability to resolve homologous proteins from closely related strains or species with sequence identities above ~95%. However, the impact on the accuracy of predicted functions is low (Fig. 2) and bacterial phenotypes are determined more by the complement of horizontally acquired accessory genes than by minor variations in protein sequences." I understand the methods behind the mmseq2 and Plass are different.... but mmseq2 should able to handle the 'intron-containing eukaryotic proteins' ... Anyway,,, could you kindly suggest a good way to identify those euk proteins?? ( the prediction of euk genes from binned euk genomes are so troublesome...)

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Plass Output (for bugs)

Please make sure to also post the complete output of Plass. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

martin-steinegger commented 4 years ago

@liuxianghui Plass extract all open reading frames from short reads and extending them through overlap detection. This works well for proteins that are encoded in an consecutive way. However, eukaryotes have introns so it is not possible to overlap the reads to extract the proteins. MetaEuk takes assemblies from meta-genomes as input and searches this assemblies six-frame translated against a reference sequence and predicts the proteins from the exons.

What makes the detection of eukaryotic genes hard? The fragmentation of the genomes?

liuxianghui commented 4 years ago

For bacteria, the usual approach is to assembly the reads into contigs and then use prodigal to predict the genes. However, this is not OK for euk, we have to do binning of genomes. Find those euk genomes and try the taxonomic assignment. Then use different tools like GeneMark-ES for prediction of gene for each genome. ( There is no tool to work with euk contigs like prodigal for bacteria ). GeneMark-ES use self-training model based on each genome to make prediction. Augustus have limited model and only apply for specify euk genomes. So I turned to metaeuk and Plass. I expect that they have help me to identify all the euk genes/proteins without going to the metagenomics binning of genomes and running GeneMark-ES. However, I am not sure how well metaeuk and Plass could do. I saw you did a lot work for marine and gut samples. Please kindly share your opinion on them. MetaEuk is claimed as a sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. So if my metagenome data does not contain novel euk species, most euk genes could be found by such a search using my assembled contigs, right? Also, PLass seems a denova approach, could it help to identify euk proteins,,,, You Plass nucleotide assembly seems to be not working well as Plass protein asembly. However, I can use mmseq2 to search my contigs against your Plass protein to identify the nucleotide genes. Does this make sense?