t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
37 stars 22 forks source link

Regions with overlapping 3'UTRs annotated #132

Closed os306 closed 7 months ago

os306 commented 1 year ago

Hello,

Apologies if this is a silly question, but I am trying to figure out what to do about regions in my tcount files that have overlapping 3'UTR's associated with them. For example, part of my non-collapsed tcount table looks like this:

Chromosome Start End Name Length Strand
1 19600884 19600971 MINOS1-NBL1 87 +
1 19626401 19629821 MINOS1,MINOS1-NBL1 3420 +
1 19634436 19634524 MINOS1-NBL1 88 +
1 19640345 19640469 MINOS1-NBL1 124 +
1 19640583 19640729 MINOS1-NBL1 146 +
1 19655011 19655200 MINOS1-NBL1 189 +
1 19655323 19655435 MINOS1-NBL1 112 +
1 19656865 19656927 MINOS1-NBL1 62 +
1 19657129 19658456 MINOS1-NBL1,NBL1 1327 +

When I then look at the corresponding genes in my collapsed tcount table I see rows that look like this:

gene_name
MINOS1,MINOS1-NBL1
MINOS1-NBL1
MINOS1-NBL1,NBL1

I haven't really been able to find a good explanation for what to do about these overlapping regions and how to account for them in the analysis. Would it be acceptable to collapse all three of these rows? I would be grateful for any advice/guidance.

Thank you!

isaacvock commented 1 year ago

The answer may depend on the specific analyses you are planning to perform, and the biology that you are studying. If what follows isn't a satisfying answer, feel free to follow up with more details about your specific experiments and what you are hoping to learn.

The simplest idea would be to keep them as is (i.e., separate features). You could end up with a nice situation where MINOS1,MINOS1-NBL1 comes up as a hit in whatever downstream analysis you are performing, but MINOS1-NBL1 and MINOS1-NBL1,NBL1 don't. In this hypothetical case, you can conclude that the MINOS1 transcript(s) must be responsible for whatever interesting behavior you observed at that site. This is just a particular example, but there are a number of combinations of results that could in principle allow you to dissect what transcripts are behaving interestingly, despite the overlapping UTRs. Obviously, it might not be this clean, but my intuition is to say that keeping all of these overlapping annotations shouldn't mess with any analysis you plan to perform.

os306 commented 1 year ago

Thank you for your valuable input!

I am treating a cancer cell line with an epigenetic inhibitor that is expected to downregulate transcription. I wasn't expecting the compound to have differential effects on different transcripts of the same gene (e.g. MINOS1), however you never know I suppose.

Similarly, I have encountered another set of rows in my collapsed tcount table that look like this:

gene_name
UGT1A1,UGT1A10,UGT1A3,UGT1A4,UGT1A5,UGT1A6,UGT1A7,UGT1A8,UGT1A9
UGT1A1,UGT1A10,UGT1A4,UGT1A6
UGT1A6
UGT1A7
UGT2A1,UGT2A2

Not really sure what to do about those either. My plan is to use DESEQ2 to perform differential gene expression analysis and I am wondering if these overlapping 'UTRs will affect the results in some way? For what it is worth, when I have downloaded tcount files from other published SLAMseq datasets I haven't seen rows with overlapping UTRs to this extent. Obviously these are completely different experiments from my own, but I was wondering if the authors have somehow merged the overlapping 'UTRs?

Thanks again for your input.

isaacvock commented 1 year ago

The existence of these in your SLAMDUNK output should not impact differential gene expression analysis. The examples you have shown are all well documented cases of overlapping gene transcription and readthrough, the lack of such complex annotations in published tables may stem from them getting filtered out due to the difficulty in dissecting which gene is responsible for any observed effects. If you are concerned though, I would suggest performing differential gene expression analysis with and without these overlapping UTRs, and confirming the agreement of these two analysis strategies.

My point in my first post was that in theory, if a subset of these overlapping genes are responding to the epigenetic inhibitor, you may be able to dissect which genes in these sets are responding and which are not. Therefore, while the interpretation of which genes are affected by the inhibitor is made more difficult in these cases due to their overlapping nature, including them in your differential expression analysis will not impact the accuracy of its output, and you may even be able to cleverly dissect any signal coming from these complicated loci. Keeping them will also ensure that you don't miss any potentially interesting biology going on at these sites.

os306 commented 1 year ago

Thanks again, I will run my differential gene expression analysis with and without these overlapping UTRs. A potential third analysis option is to merge any overlapping UTRs together - do you foresee any issue with this?

isaacvock commented 1 year ago

No problem! Can you clarify what you mean by merging the UTRs? Would you combine the data for all of the UTRs from genes for which at least one UTR exists that is assigned to multiple genes? This wouldn't have any broad detrimental impact on the differential expression analysis, but it's a bit unnecessary for cases where some of the genes that would get merged have UTRs that are unique to them. For example, if how I described it is how you plan to merge them, UGT1A6, which currently has at least one UTR unique to it, would get grouped in with all of the genes with UTRs that are ambiguously assigned to multiple genes. So you'd risk dampening any UGT1A6-specific signal in your data if you did this merging.

os306 commented 1 year ago

Yes that is along the lines of what I was thinking- for example the three MINOS1-NBL1 rows above would be merged into 1 row (and the numeric values in each of the columns from these three rows would be summed up). Similarly, the UGT1A6 row would be merged with the two rows above it (because both rows also contain "UGT1A6") I can see how that may dampen a UGT1A6-specific signal however...