Correlation half-life calculation and DESeq2?

quarkaffi commented 2 years ago

Hi,

I was using the slam-seq pipeline and everything ran very smoothly. I did a pulse-chase experiment (24 labelling followed by a chase with 0h, 2h, 4h and 8h timpoints). I then did a differential gene expression analysis using DESeq2 comparing 8h vs 0h after the onset of the chase (using the readCount column for the sizefactor). I then went on to calculate the half-lives of all genes using the model described in another issue here. While my calculated half-lives correlate reasonably with other published datasets with pearson's correlations of 0.5-0.6 (I use another cell line so I assumed that this is reasonable), the correlation of my DESeq2 analysis and the half-lives is extremely poor (0.126).

Shouldn't be the genes that are most negatively affected in the DESeq2 analysis also the ones that show the lowest half-life and vice versa, or am I missing something? Would it help if I use a more stringent counting cutoff (currently it is at the standard input of 1)? Or am I getting something completely wrong here and they shouldn't correlate? I am extremely puzzled by that outcome.

The read-length is 120, on average 50M reads per replicate (done in triplicates).

Best Martin

isaacvock commented 2 years ago

I wouldn't expect a correlation between the half-lives and the L2FC comparing the pre-chase and 8 hour post chase RNA-seq data. It is reasonable to expect the half-life to correlate with the RPKM normalized read counts, since greater half-lives should yield (on average) more abundant transcripts. Comparing read counts pre-chase and 8 hour post chase is a bit strange though as unless s4U is messing with the cells (and it very well could be, which is why I prefer pulse-label to pulse-chase experimental designs since pulse-labeling reduces the amount of time cells are exposed to s4U), I wouldn't expect there to be differences in read counts between the samples you are comparing. Thus, you might expect any apparent L2FCs to be the result of experimental noise rather than real biological differences. Because of this, it is not surprising that your L2FC vs. half-life plot looks kind of like a rotated MA plot, with the spread of the L2FCs decreasing with higher half-lives, just as the L2FC spread in an MA plot typically decreases with greater read coverage (since half-life is likely correlated with read coverage).

If you had done your pulse-chase experiment in two different conditions (e.g., wildtype cells and cells that are perturbed in some way) and you were to calculate L2FC(half-lives) and L2FC(read counts) between the two different conditions, then I would expect L2FC(half-lives) to be at least somewhat correlated with L2FC(read counts), as long as the perturbation is something that is known to affect RNA stability. Still though, I wouldn't usually expect changes in RNA abundance to correlate with the half-lives themselves.

t-neumann commented 2 years ago

Excellent summary from @isaacvock

You would not expect any genes to be changing by adding 4SU and watching them over time - that was the whole point why 4SU labelling was introduced, since it should not interfere with any cellular processes and the cell should remain in its default physiological state. So any log2FC you are reading out are either 4SU effects or anyhing other unwanted biases you introduce experimentally.

quarkaffi commented 2 years ago

Thank you very much for your replies and help. I think there is a misunderstanding here. I did not perform a normal DESeq2 analysis on the total read count, but rather on the TC-conversion column (with the total readcount only used to define the sizefactors). The reason why I did it is because I saw something similar was performed in another publication and thought it would be a good quality control check (see Figure D) (https://doi.org/10.1093/nar/gky556). So in the DESeq2 analysis using the TC-conversion column, you wouldnt expect that fast turnovers should be on the far negative site of the plot?

isaacvock commented 2 years ago

Ah, my apologies for the misunderstanding. I still don't think that your L2FC vs. log(half-life) plot deviates too strongly from my expectations though. Regardless of half-life, there should be (on average) less reads with T-to-C conversions 8 hours post chase than prior to the chase. Therefore, the fact that the L2FCs are predominantly negative makes a lot of sense. Furthermore, I would expect the magnitude of the L2FCs to be the lowest for the longest half-life transcripts, as the fraction of labeled transcripts remaining after the chase will be higher for more stable transcripts; shorter half-lives would cause the labeled transcripts to get cleared out more quickly during the chase. Therefore, the fact that the largest negative L2FCs are for the least stable transcripts is promising. The significant amount of scatter in L2FCs at low half-lives could be related to the point I made in my first post, that less stable transcripts are less abundant and thus will yield noisier L2FCs.

In regards to the paper that inspired your analysis, they were analyzing a sufficiently different type of data (and I should know as I am a PhD student in the lab that published that paper!), which is why your L2FC correlation plot differs so starkly from theirs. In that paper, they pulled down s4U containing RNA with biotin-streptavidin after either a 0.5 hour or 18 hour chase. While all transcripts will be less s4U labeled after a 0.5 hour chase than an 18 hour chase, those that are more stable will have more labeled RNA remaining (and thus more RNA that can be pulled down) than those that are less stable. Thus, when you sequence the enriched RNA and compare read counts in the 0.5 hour and 18 hour chases, you see a correlation between L2FCs (log2 of 0.5 hour chase/18 hour chase read count) and half-lives. Unstable transcripts take up less of the total enriched RNA pool after a longer chase than more stable transcripts.

quarkaffi commented 2 years ago

Okay, I think I found what was off. The analysis just finished of my modified run. Yesterday I noted that there are quite some overlapping features in my 3UTR bedfile which presumable lead to regions being counted multiple times for different genes (and the same ones). The way that I collapsed these files was to sum up all the counts from the same transcript and went ahead with the downstream analysis (except for the conversion where I averaged it).

I got rid of all the ambiguous features and merged overlapping features (if they are coming from the same gene) into one and reran the analysis. Now everything correlates with almost the same Pearson's as in the paper of your lab!

Thank you again for all the help and it was good to have that little sanity check as I was losing my mind in the last days.

isaacvock commented 2 years ago

I guess I'm still confused as to why the correlation would be identical, as I would expect nothing but negative L2FCs in your case whereas in that paper there is an even spread of positive and negative L2FCs (unless you are in fact enriching for s4U RNA, like TT-SLAM-seq? Or I am missing something about the normalization strategy. Though I guess you could get a similar looking correlation plot just with the x-axis shifted). Anyways, I'm glad you were able to figure everything out, sorry for all of my misunderstandings!

quarkaffi commented 2 years ago

Yes, it is simply shifted. And you are right, the shape is still a bit more like a cone, however much better behaved.

No worries, it was my fault for not being clearer. Really appreciate all your input.

t-neumann / slamdunk

Correlation half-life calculation and DESeq2? #119