[Feature Request] Assembly-level Eukaryotic Metatranscriptomic Functional Annotation & Read-count Mapping

erfanshekarriz commented 1 year ago

Hi Josh,

I wanted to update you that the previous paper I was working on has been accepted and should be published in the following weeks. We made the decision to redact the metagenomics and eukaryotic binning analysis and focus on transcriptomic (for practical reasons, but we hope to do the eukaryotic binning soon for another study).

I still used VEBA for analyzing metatranscriptomic data. I know that for eukaryotic genomes VEBA uses annotations on the bin level, but realistically this would limit the usage for metatranscriptomic applications (You can find the poor performance of eukaryotic metatranscriptomic binning here https://taylorreiter.github.io/2017-10-02-Binning-metatranscriptomes/) .

Metatranscriptomics is still the best way to analyze eukaryotic genes in the environment, and I think since VEBA's selling point is eukaryotic (+viral) annotation, a pipeline for counting their genes would be really helpful. Personally, I don't have a pipeline for this, and instead, use some form of alignment (e.g. BLAST, HMMs) to extract the genes from the assembly first and then perform counting on it.

I think this would be a good feature to look into to really cater to those wanting to do eukaryotic omics analysis.

Best,

Erfan

jolespin commented 1 year ago

Congratulations on the publication! Looking forward to checking it out once it’s released.

Yea I agree that binning transcripts isn’t a good idea and that VEBA should be able to handle these. I’ll add a new module to adapt what I did in our coral study: https://www.science.org/doi/10.1126/sciadv.abg3088

Basically, my thought for this will be to use transcripts from rnaSPAdes (or TRINITY) as input, then run transDecoder. The annotations/classifications will be tricky because the microeuk database would be more comprehensive but it won’t capture any proks. I’ll need to think about this for a little bit. I also need to adapt the classify euk module so it can classify from genomes and not just the binning results.

On Dec 21, 2022, at 11:28 PM, Erfan Shekarriz @.***> wrote:

Hi Josh,

I wanted to update you that the previous paper I was working on has been accepted and should be published in the following weeks. We made the decision to redact the metagenomics and eukaryotic binning analysis and focus on transcriptomic (for practical reasons, but we hope to do the eukaryotic binning soon for another study).

I still used VEBA for analyzing metatranscriptomic data. I know that for eukaryotic genomes VEBA uses annotations on the bin level, but realistically this would limit the usage for metatranscriptomic applications (You can find the poor performance of eukaryotic metatranscriptomic binning here https://taylorreiter.github.io/2017-10-02-Binning-metatranscriptomes/) .

Metatranscriptomics is still the best way to analyze eukaryotic genes in the environment, and I think since VEBA's selling point is eukaryotic (+viral) annotation, a pipeline for counting their genes would be really helpful. Personally, I don't have a pipeline for this, and instead, use some form of alignment (e.g. BLAST, HMMs) to extract the genes from the assembly first and then perform counting on it.

I think this would be a good feature to look into to really cater to those wanting to do eukaryotic omics analysis.

Best,

Erfan

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

erfanshekarriz commented 1 year ago

Hi Josh,

That sounds great. I also need to think about this because we're currently sequencing poly-A transcriptomes of a bunch of deep-sea samples.

In that case most genes should be eukaryotic and simplify things.

The type of metatranscriptomics that captures all community members incouding prokaryotes should be slightly trickier.

Anyhow, this development sounds greay and if I run into any good solutions I will let you know!

Best

Erfan

jolespin commented 1 year ago

Adding this to do the DEVELOPMENT LOG. I've been thinking about it and I can't really find a good term to use for a module name. I want to keep the transcriptome assembly separate as it is already achieved with assembly.py using rnaspades.py (though, I might add TRINITY as an alternative option now since I added MEGAHIT as an alternative for metaG). Instead of making an entirely new module, I'll probably just add a wrapper script.

Basically what it's going to do is the following: https://github.com/TransDecoder/TransDecoder/wiki

1: extract the long open reading frames

TransDecoder.LongOrfs -t target_transcripts.fasta

2: Check for homology against the Pfam db
3: predict the likely coding regions TransDecoder.Predict -t target_transcripts.fasta --retain_pfam_hits pfam.domtblout --retain_blastp_hits blastp.outfmt6

It's also going to use the gene to transcript table which it will generate automatically if rnaspades identifiers are provided.

Lastly, it going to output one ORF per transcript. This should work for your purpose. I've had to do this in a few coral and diatom studies so it'll be worth automating in a way that other people can use as well.

jolespin commented 1 year ago

Just wanted to give a heads up that I just wrote a wrapper around transdecoder which includes the diamond and hmmsearch options. I also modified the GFF file so it's more useful. This will be released in v1.1 which will be a HUGE release. More soon.

erfanshekarriz commented 1 year ago

Just wanted to give a heads up that I just wrote a wrapper around transdecoder which includes the diamond and hmmsearch options. I also modified the GFF file so it's more useful. This will be released in v1.1 which will be a HUGE release. More soon.

Hi Josh.

Great news and look forward to using this in my new project! Good luck with the release and thanks for getting back to me.

Best,

Erfan

jolespin commented 1 year ago

Here is the script:

https://github.com/jolespin/veba/blob/main/src/scripts/transdecoder_wrapper.py

I've also updated the transcript assembly to output the gene to transcript table automatically. Also included with it is a modification on the GFF file creation to have more identifiers. Also made it compatible with the prodigal and MetaEuk gff modifications so they can all be used with featureCounts together.

I hadn't realized that transDecoder forces the output into the current working directory despite the output directory parameter. This makes it so each transcript file must have a different name, run each sample individually, or run them in batch.

Going to look for more stable options in the future but thought this would still be useful in the interim.

Here's the current list of updates for v1.1.0.

https://github.com/jolespin/veba/blob/main/DEVELOPMENT.md

I might add a few more tweaks to the annotate module (eg AMR genes) and consensus orthogroup annotation but other than it's pretty finalized.

jolespin / veba

[Feature Request] Assembly-level Eukaryotic Metatranscriptomic Functional Annotation & Read-count Mapping #19