t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
39 stars 23 forks source link

About calculating T>C reads CPM #146

Closed bioinformatica closed 7 months ago

bioinformatica commented 7 months ago

Dear Neumann:

I tried to repeat Figure 3 and calculate the CPM value of T>C, which is about 100 times larger than the figure in the paper. I use R script for tcount.tsv, the calculation formula is TcReadCount_CPM = TcReadCount/sum(ReadCount)*1000000. Is there anything wrong?

image

please kindly help.

t-neumann commented 7 months ago

Hi,

the axis ranges look roughly comparable to me, where do you see the 100-fold increase?

bioinformatica commented 7 months ago

The minimum value of t>c cpm I drew is only 10^-1, but in the article it can reach 10^-3.

t-neumann commented 7 months ago

Did you download the original count files from the paper and repeat the plots using those tables, as sanity check?

bioinformatica commented 7 months ago

I just tried running tsv on GSE. The following is my code and results. Can you help me figure out what's wrong?I

setwd("/home/mpiage/deseq/slamseq") count_table<-read.table("./Slamdunk_45/count/GSM2666849_47760_mESC_45min_pulse_Rep_1.tsv",sep = "\t",header=1) count_table<-count_table[count_table$TcReadCount != 0,] count_table<-count_table[count_table$ReadCount >= 100,]

TcReadCount<-count_table$TcReadCount ReadCount<-count_table$ReadCount sum(ReadCount) TcReadCount_CPM = TcReadCount/sum(ReadCount)1000000 ReadCount_CPM =ReadCount/sum(ReadCount)1000000 data <- data.frame(SteadyState_CPM = (ReadCount_CPM), TcReads_CPM = (TcReadCount_CPM),Name=count_table$Name )

p <- ggplot(data, aes(x = SteadyState_CPM, y = TcReads_CPM)) + theme_minimal()+ geom_point(alpha = 0.04) +
labs(x = "Steady state (cpm)",y = "T > C reads (cpm)") + scale_x_log10() + # Set x-axis to log scale with base 10 scale_y_log10() + # Set y-axis to log scale with base 10 theme_minimal() #

print(p)

image

t-neumann commented 7 months ago

I see you use only one replicate instead of all three. Maybe directly use the data from Supplementary Table S1 from the paper?

bioinformatica commented 7 months ago

Thank you so much. I want to repeat it from the beginning. How to calculate T>C and all reads cpm of 3 tsv files?

bioinformatica commented 7 months ago

I added them up and divided them by 3, and still the cpm did not reach 10^-3 image

count_table_1<-read.table("./Slamdunk_45/count/GSM2666849_47760_mESC_45min_pulse_Rep_1.tsv",sep = "\t",header=1) count_table_2<-read.table("./Slamdunk_45/count/GSM2666850_47761_mESC_45min_pulse_Rep_2.tsv",sep = "\t",header=1) count_table_3<-read.table("./Slamdunk_45/count/GSM2666851_47762_mESC_45min_pulse_Rep_3.tsv",sep = "\t",header=1)

ReadCount<-(count_table_1$ReadCount +count_table_2$ReadCount + count_table_3$ReadCount)/3 TcReadCount<-(count_table_1$TcReadCount + count_table_2$TcReadCount +count_table_3$TcReadCount )/3

total<- sum(ReadCount)

ReadCount<-ReadCount[TcReadCount != 0] TcReadCount<-TcReadCount[TcReadCount != 0]

TcReadCount<-TcReadCount[ReadCount >= 100] ReadCount<-ReadCount[ReadCount >= 100]

TcReadCount_CPM = TcReadCount/total1000000 ReadCount_CPM =ReadCount/total1000000

data <- data.frame(SteadyState_CPM = (ReadCount_CPM), TcReads_CPM = (TcReadCount_CPM) )

p <- ggplot(data, aes(x = SteadyState_CPM, y = TcReads_CPM)) + geom_point(alpha = 0.04) +
labs(x = "Steady state (cpm)",y = "T > C reads (cpm)") + scale_x_log10() + # Set x-axis to log scale with base 10 scale_y_log10() + # Set y-axis to log scale with base 10 theme_minimal() #

print(p)

bioinformatica commented 7 months ago

count_table<-read.table("./Slamdunk_45/count/GSM2666849_47760_mESC_45min_pulse_Rep_1.tsv",sep = "\t",header=1) ReadCount<-count_table$ReadCount TcReadCount<-count_table$TcReadCount TcReadCount_CPM_1 = TcReadCount/sum(ReadCount)1000000 ReadCount_CPM_1 =ReadCount/sum(ReadCount)1000000 index_1 <- ReadCount>100

count_table<-read.table("./Slamdunk_45/count/GSM2666850_47761_mESC_45min_pulse_Rep_2.tsv",sep = "\t",header=1) ReadCount<-count_table$ReadCount TcReadCount<-count_table$TcReadCount TcReadCount_CPM_2 = TcReadCount/sum(ReadCount)1000000 ReadCount_CPM_2 =ReadCount/sum(ReadCount)1000000 index_2 <- ReadCount>100

count_table<-read.table("./Slamdunk_45/count/GSM2666851_47762_mESC_45min_pulse_Rep_3.tsv",sep = "\t",header=1) ReadCount<-count_table$ReadCount TcReadCount<-count_table$TcReadCount TcReadCount_CPM_3 = TcReadCount/sum(ReadCount)1000000 ReadCount_CPM_3 =ReadCount/sum(ReadCount)1000000 index_3 <- ReadCount>100

ReadCount_CPM = (ReadCount_CPM_1+ReadCount_CPM_2+ReadCount_CPM_3)/3 TcReadCount_CPM = (TcReadCount_CPM_1+TcReadCount_CPM_2+TcReadCount_CPM_3)/3

ReadCount_CPM<-ReadCount_CPM[TcReadCount_CPM != 0] TcReadCount_CPM<-TcReadCount_CPM[TcReadCount_CPM != 0]

TcReadCount_CPM<-TcReadCount_CPM[ReadCount_CPM > 10^(0.5)] ReadCount_CPM<-ReadCount_CPM[ReadCount_CPM > 10^(0.5)]

data <- data.frame(SteadyState_CPM = (ReadCount_CPM), TcReads_CPM = (TcReadCount_CPM) )

p <- ggplot(data, aes(x = SteadyState_CPM, y = TcReads_CPM)) + geom_point(alpha = 0.04) +
labs(x = "Steady state (cpm)",y = "T > C reads (cpm)") + scale_x_log10() + # Set x-axis to log scale with base 10 scale_y_log10() + # Set y-axis to log scale with base 10 theme_minimal() #

print(p) image

t-neumann commented 7 months ago

Did you compare the x and y axis with the ones from the supplemental table? I have the suspicion that you do not correct for no4SU treated samples as stated in there (Mean of all T>C Reads (CPM) (45min4SU - 0h 4SU)):

image
bioinformatica commented 7 months ago

Thank you.