liulab-dfci / MAESTRO

Single-cell Transcriptome and Regulome Analysis Pipeline
GNU General Public License v3.0
277 stars 77 forks source link

Does enhanced RP gene score drop peaks in exons/promoter regions of multiple genes? #107

Open mengxiao opened 3 years ago

mengxiao commented 3 years ago

Based on the MAESTRO paper, my understanding is that a peak that is in the exons or promoter region of more than one gene is scored as 0 for all genes:

In an “enhanced RP model,” [...] if the peaki is located in the promoter or exon regions of any nearby genes, then Wij is set to 0 (i.e., the peak is excluded).

If I've understood correctly, Wij is calculated by RP_AddExonRemovePromoter, which keeps track of which peaks are in exons/promoter regions. However, it doesn't seem to track which ones are in more than one gene. Is it possible that an additional step to filter such peaks has been left out?

taoliu commented 3 years ago

@mengxiao According to https://github.com/liulab-dfci/MAESTRO/blob/28e2927b938242960180a82688c955bd7c427c22/MAESTRO/scATAC_Genescore.py#L233-L237

All the peaks will be classified as either the peaks in any 'gene-bodies' (=exon+promoter)', or the peaks outside of any 'gene-bodies'. If a peak_i is in the first class, it will be excluded from later calculation (L238 and so one) while applying a decay function. The definition of a peak_i 'located in the promoter or exon regions of any nearby genes' doesn't exclude the peak overlapping with the exon/promoter of the given gene_j itself. So my understanding of the current implementation is that if the peak_i is overlapping with the exon of the given gene, it will have a weight of 1 no matter whether it's overlapping with other genes; if the peak_i is not overlapping with the given gene, but it's overlapping with any nearby gene exons, it will have a weight of 0. For me, the confusion is on how the isoforms of a gene model are considered? In the current codes, each isoform is independent while calculating the RPs and they will inevitably influence each other. In brief, we are still looking into this issue.

mengxiao commented 3 years ago

Ah interesting. May I ask a bit about the rationale behind that approach? Based on the supplement, it sounded like the enhanced model tries to avoid misattributing peaks to unexpressed neighboring genes. If that's the goal, wouldn't it make sense to exclude any peaks in the 'gene bodies' of multiple genes?

The issue of isoforms might be being addressed here https://github.com/liulab-dfci/MAESTRO/blob/28e2927b938242960180a82688c955bd7c427c22/MAESTRO/scATAC_Genescore.py#L370-L383 I think this chooses the isoform with highest total RP across all cells as the one to represent a gene. A possible wrinkle to that comes in parsing the gene info, where I think if there are multiple isoforms for a gene that share the first and last exonic bases (but possibly differ e.g. in middle exons?), only the first one in the annotation is kept: https://github.com/liulab-dfci/MAESTRO/blob/28e2927b938242960180a82688c955bd7c427c22/MAESTRO/scATAC_Genescore.py#L87

mengxiao commented 3 years ago

I was thinking a bit more about this- is the reason for handling overlaps this way that one doesn't expect many instances of peaks that are in promoter/exons of multiple genes? Dropping the promoter/exon regions is explicitly done for the exponential decay component of the model because that is much more likely to involve overlap between genes?