Please help me to understand why we need alleyoop collapse.

realzhang commented 9 months ago

Please allow me quote the help of alleyoop collapse here first : This tool allows you to collapse all 3’UTR entries of a tcount file into one single entry per 3’UTR (similar to the exons->gene relationship in a gtf-file). All entries with identical 3’UTR IDs will be merged. It's hard for me to understand the following questions:

How or why mutiple entries are generated for one 3'UTR?
If we need to collapse them for DEG calling, are these overlapping reads be counted twice during collapse?

Thank you very much.

isaacvock commented 9 months ago

I'll let the developer give a more definitive answer, but this topic was partially discussed in #132, so you might find some answers there. My sense is that while the wording in the documentation is a bit confusing, you are effectively combining all of the alternative UTRs annotated for a given gene into one "UTR-region", associated with one gene ID. This has the benefit of increasing statistical power of downstream analyses by aggregating reads from the same gene.

I am not sure about the question of double counting reads, though I suspect that slamdunk handles this properly and does not double count. My understanding is that reads are uniquely assigned to a single UTR based on which annotated UTR is the best match. So there are no instances of one read mapped to multiple UTRs in the uncollapsed tcount file.

isaacvock commented 9 months ago

From the SLAMDUNK documentation:

" We have settled for a very conservative multimapper reassignment strategy:

Since the QuantSeq technology specifically enriches for 3’ UTRs, we only consider alignments to annotated 3’ UTRs supplied to slamdunk as relevant.

Therefore, any multimappers with alignments to a single 3’ UTR and non-3’UTR regions (i.e. not annotated in the supplied reference) will be unequivocally assigned to the single 3’UTR. If there are multiple alignments to this single 3’UTR, one will be chosen at random.

For all other cases, were a read maps to several 3’ UTRs, we are unable to reassign the read uniquely to a given 3’UTR and thus discard it from the analysis. "

t-neumann commented 9 months ago

Hi - sorry I was on retreat this week, so here's my answer:

What biologists often want is any DEG analysis is one value per gene. To this end, even for classical DEG analysis, all reads from all isoforms of a genes are aggregated into one number. Same is now true for Quantseq, where we sequence 3' ends. Now isoforms that have the same 3' end will be covered by a single value anyways, but alternative isoforms ending in different 3' ends will have different values. Now to get one value per gene, covering all possible isoforms, it is necessary to aggregate all read counts for all 3' ends of a gene, analogous to a regular RNA-seq analysis. This is the use case of alleyoop collapse which aggregates the numbers for all 3'ends recorded for a given gene.

Typically one would make sure beforehand (as we do), that overlapping 3' ends are merged before counting and you will have only disjoint 3' end intervals. So in that sense, yes we would count reads double if they map to several intervals, but we usually take care of that by producing a 3' end annotation where this is prevented in the first place.

t-neumann / slamdunk

Please help me to understand why we need alleyoop collapse. #141