pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
631 stars 168 forks source link

Quantify the expression of transposable elements #380

Closed zhaotao1987 closed 1 year ago

zhaotao1987 commented 1 year ago

Hi,

We are interested in the expression of transposable elements (TE). An intact TE contains LTR regions and CDS region that encode transposases, which behave like polycistronic mRNA. I was wondering if I could use these CDS regions of each TE as the reference to quantify TE expression using Kallisto. Another question is should I use all TE CDS as references, or TE CDS plus entire predicted genes as the reference. The RNA-seq data we used is just normal illumina-based RNA-seq data.

Thanks very much! Best, Tao

maximilianh commented 1 year ago

I think you should use all sequences for the kallisto run, because TEs are very diverse and many sequences will not have a good alignment if you align to the consensus (this depends on the type of TE to some extent). Then, after alignment, you'd map the reads from the TE copy to a consensus, one per type.

Such a mapping for human is described here, the accompanying website has the consensus and a tool for the mapping: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-020-00208-w and the methods point to a script that can make them for other organisms. I am also aware of an R module for this task (and can point you to it)

Let me know if that doesn't work.

best Max

On Thu, Mar 23, 2023 at 9:12 AM Tao @.***> wrote:

Hi,

We are interested in the expression of transposable elements (TE). An intact TE contains LTR regions and CDS region that encode transposases, which behave like polycistronic mRNA. I was wondering if I could use these CDS regions of each TE as the reference to quantify TE expression using Kallisto. Another question is should I use all TE CDS as references, or TE CDS plus entire predicted genes as the reference. The RNA-seq data we used is just normal illumina-based RNA-seq data.

Thanks very much! Best, Tao

— Reply to this email directly, view it on GitHub https://github.com/pachterlab/kallisto/issues/380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJ4BINSXFPJC72DPXLW5QAYPANCNFSM6AAAAAAWE3A7EY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

zhaotao1987 commented 1 year ago

Thank you so much Max for the response. So you suggested we use all predicted TE plus predicted genes for kallisto quantification, and for the predicted TEs, we use only the coding regions (exclude the flanking LTR regions). And after this, I am not quite aware of the purpose you've proposed, we then further classify the mapped reads (those mapped onto TE CDS) to families, using a consensus sequence for each TE family(?) In order to have an idea such as which clade of TE highly expressed? Thanks very much! Best, Tao

maximilianh commented 1 year ago

If you mean "coding sequence" with CDS, I dont know why you would restrict yourself to that. I'd use the entire sequence of all TEs.

And yes, I would first align to the TE genome sequences and then map from that to the consensus, not align to the consensus with kallisto.

On Tue, Apr 4, 2023 at 10:55 AM Tao @.***> wrote:

Thank you so much Max for the response. So you suggested we use all predicted TE plus predicted genes for kallisto quantification, and for the predicted TEs, we use only the coding regions (exclude the flanking LTR regions). And after this, I am not quite aware of the purpose you've proposed, we then further classify the mapped reads (those mapped onto TE CDS) to families, using a consensus sequence for each TE family(?) In order to have an idea such as which clade of TE highly expressed? Thanks very much! Best, Tao

— Reply to this email directly, view it on GitHub https://github.com/pachterlab/kallisto/issues/380#issuecomment-1495596907, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TNMNHOONESMYOJYZOLW7POYZANCNFSM6AAAAAAWE3A7EY . You are receiving this because you commented.Message ID: @.***>

zhaotao1987 commented 1 year ago

I see, thanks very much. Because I think the entire TE contains LTR regions and the coding regions, something similar as showed in this figure. In theory, the LTR regions are not transcibed ? image

maximilianh commented 1 year ago

Oh LTRs probably not, but I imagine for many other elements, they have sequences that are neither LTR nor coding and I'd keep them. Yes, probably a detail, but also makes it easier to take the entire sequence from the repeatmasker output rather than start to filter on CDS annotations in there.

On Tue, Apr 4, 2023 at 12:40 PM Tao @.***> wrote:

I see, thanks very much. Because I think the entire TE contains LTR regions and the coding regions, something similar as showed in this figure. In theory, the LTR regions are not transcibed ? [image: image] https://user-images.githubusercontent.com/16197676/229766823-d7fa054a-0f63-4f79-940a-790fd9cdf766.png

— Reply to this email directly, view it on GitHub https://github.com/pachterlab/kallisto/issues/380#issuecomment-1495741240, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJSXAP2SLUZTOC2AJDW7P3DPANCNFSM6AAAAAAWE3A7EY . You are receiving this because you commented.Message ID: @.***>

zhaotao1987 commented 1 year ago

I see, thanks!