Understanding count matrix discrepancies and exploring transcript-level differential expression with TEcount/TEtranscripts

ptranvan commented 2 months ago

Hi,

1) I would like to know how TEcount/TEtranscripts produce the count matrix. I have aligned my paired-end reads with STAR using the recommended parameters (-outFilterMultimapNmax 100 --winAnchorMultimapNmax 100) and extracted the coverage with bedtools coverage. Here are the coverage for 2 samples:

CHR START   STOP    GENE_ID COV_SAMPLE1 COV_SAMPLE2
4   55169952    55171433    L1HS    2   .
4   55180241    55186271    L1HS    6   9
16  35880225    35884717    L1HS    10  8
17  27415684    27421690    L1HS    4   6
X   60593348    60599398    L1HS    12  24

And here is the count in TEtranscripts_out.cntTable

    SAMPLE1 SAMPLE2
L1HS:L1:LINE    13  9

Why are there inconsistencies?

2) I am interested in the differential expression of transcripts instead of genes. For instance, is it possible to analyze each L1HS on each chromosome individually (instead of summing)?

olivertam commented 2 months ago

Hi,

1) The reason is that a tool like bedtools treats each alignment as an independent entry, whereas that read could have aligned to multiple location (and thus represent only one read). Our software handles this by splitting the reads to all possible genomic locations (i.e. all L1HS copies that the read mapped to), and then determine the most likely location based on EM. It then aggregates based on subfamily (gene_id). Also, if a read overlaps both a gene and a TE, it is assigned differently depending if the read was uniquely/unambiguously mapped (i.e. one genomic location) or mapped to multiple location. For the former, our software will assign the read to the gene, while the latter would go to the TE. This is based on the assumption that a uniquely aligned read is more likely to come from a gene (which tend to be more "unique") than from a TE (which tends to have very similar sequences in multiple genomic locations)

2) if you're interested in locus-level quantification, you can try TElocal.

Thanks

ptranvan commented 2 months ago

Thank you for your response @olivertam . I'm still not completely clear on the following:

If a read maps to two 100% identical transposable elements (TEs), for example, one from chromosome 1 and another from chromosome 5, how do you determine its actual location? Would it be 1 or 2 counts ?

Regarding counts for paired-end reads: Do both reads need to map to the same feature (and within a proper insert size range) to be counted as a single occurrence?

olivertam commented 2 months ago

Hi,

Sorry if I didn't make things clear enough.

Here is a very brief summary of how the quantification is done: 1) Gene and TE annotations are identified for each alignment (for paired reads, each end of the alignment is annotated independently) 2) If the read is uniquely aligned (i.e. single genomic alignment for the single/paired end), then the read is evenly split into the possible gene annotations (e.g. if there's only 1 gene annotation, then that gene gets 1 count, if there are 2, each gene gets 0.5). If there are no gene annotation overlaps, then the TE gets the read. 3) If the read maps to multiple locations, then the read is evenly split into the possible genomic locations (so if there are five genomic alignments, each alignment gets 1/5 of the read). It is then further split based on the number of TE annotations at each location (so if an alignment hits two TE, it will get 1/5 * 1/2 = 1/10 of the read), or gene annotations if there are no overlapping TE. This will continue for the rest of BAM file, until we go through all the alignments 4) We then take the TE reads that were split (from step 3), and run them through an EM to determine the result that best reflect how the reads should be distributed (you can read the paper to get the exact method). 5) Once the EM redistribution is done, the counts are added back to those that were not redistributed (i.e. where there was no ambiguity that the read belonged to a particular TE copy), and then aggregated by subfamily (gene_id). We then round to integers and output the count table.

For your first scenario, the very short answer is that it's 1 count. The longer answer is that the read is initially split into two, so each copy (chr1 and chr5) gets 0.5. Once all the TE have been distributed this way, then we will look at the overall proportion of reads assigned to the chr1 and chr5 copies, and determine if the even split is valid, or if it should be redistributed differently.

This is a simplified and not mathematically precise example: Read A mapped to your two TE (chr1 and chr5), and will be initially given 50-50 to each copy:

chr1-TE: 0.5
chr5-TE: 0.5

Once all the TEs are initially distributed, then we find that the two TE have these counts (we might have reads that match the chr5 copy and something else, but not the chr1 copy):

chr1-TE: 5
chr5-TE: 10

So we go back to the read A and say that we want to split this 1/3 to chr1 and 2/3 to chr5 (based on the ratio between chr1-TE and chr5-TE). We do this for all reads that had multiple TE annotations, and see how the counts changed:

chr1-TE: 3
chr5-TE: 12

So we iterate again and split read A 1/4 to chr1 and 3/4 to chr5 (based on the ratio between chr1-TE and chr5-TE). We do this for all TE that had multiple TE copies, and we get this:

chr1-TE:3
chr5-TE:12

Since we're not seeing major difference from the previous iteration, we would say that we're arrived as the most likely scenario, and would therefore say that in this library/BAM file, chr1-TE had 3 reads and chr5-TE had 12 reads. Note that we don't keep the distribution for each read after the EM step, so we can't easily trace back the exact distribution of read A.

To directly answer your second question, both ends of the read can have different annotations, and will be distributed depending on whether it is unique/multi and whether it's a gene or TE.

Let me know if things are still unclear.

Thanks.

ptranvan commented 2 months ago

Thanks it's clear !

A general questions about RNA-seq and TEs:

Does standard protocol for RNA-seq like Illumina Stranded mRNA Prep (with polyA capture) is able to capture transcripts from TEs ?

Or what kind of protocol do you recommend ?

olivertam commented 2 months ago

Hi,

Yes, you are able to capture many TE with polyA capture, but there was a study that suggested that total RNA might capture a broader range.

We typically recommend total RNA to maximize breadth of capture, but we haven't vigorously tested the difference for ourselves.

Thanks.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

mhammell-laboratory / TEtranscripts

Understanding count matrix discrepancies and exploring transcript-level differential expression with TEcount/TEtranscripts #186