Closed parcbioinfo closed 4 years ago
Hello PDK:
Thanks for writing about this issue. Yes, I can definitely look into this problem. Would you be able to paste the code you used to create this plot and the structure of your data_st
and dataMetrics
objects (i.e. the output of running str(data_st) and str(dataMetrics))? I think you provided that output earlier in Issue 6 but it seems to have been deleted (hopefully you resolved that problem on your own otherwise I can try to help with that too). Thanks again!
Sincerely, Lindsay
Thanks for your response. Here it is:
logcpm.df<-as.data.frame(cbind.data.frame(ID=rownames(logcpm),logcpm))
names(logcpm.df)
names(logcpm.df)<-c("ID","I.S77", "E.S55", "C.S49", "E.S54", "I.S60", "C.S75", "E.S56", "I.S58", "C.S47", "C.S76", "C.S46", "E.S52", "E.S53", "E.S51", "C.S50", "C.S48", "I.S57", "I.S59")
names(logcpm.df)
#Ordering samples columns so that data_st and dataMetrics order matches
logcpm.df <- logcpm.df[c(1,4,7,10,11,12,16,17,3,5,8,13,14,15,2,6,9,18,19)]
logcpm.df$ID<-as.character(logcpm.df$ID)
names(logcpm.df)
logcpm.df <- logcpm.df[logcpm.df$ID %in% rownames(y$counts),]
Group = factor(c("C","C","C","C","C","C","C","E","E","E","E","E","E","I","I","I","I","I"))
design <- model.matrix(~0+Group, data=logcpm.df[,2:19])
colnames(design) <- levels(Group)
d <- estimateDisp(logcpm.df[,2:19], design)
fit <- glmFit(y, design)
dataMetrics <- list()
for (i in 1:(ncol(fit)-1)){
for (j in (i+1):ncol(fit)){
contrast=rep(0,ncol(fit))
contrast[i]=1
contrast[j]=-1
lrt <- glmLRT(fit, contrast=contrast)
lrt <- topTags(lrt, n = nrow(y[[1]]))[[1]]
setDT(lrt, keep.rownames = TRUE)[]
colnames(lrt)[1] = "ID"
lrt <- as.data.frame(lrt)
dataMetrics[[paste0(colnames(fit)[i], "_", colnames(fit)[j])]] <- lrt
}
}
str(dataMetrics, strict.width = "wrap")
List of 3
$ C_E:'data.frame': 16689 obs. of 6 variables: ..
..$ ID : chr [1:16689] "ENSMMUG00000015308" "ENSMMUG00000004347" "ENSMMUG00000007338" "ENSMMUG00000050593" ... ..
..$ logFC : num [1:16689] -2.343 -1.451 -0.708 2.712 -1.236 ... ..
..$ logCPM: num [1:16689] -0.328 0.736 6.448 -1.045 3.786 ... ..
..$ LR : num [1:16689] 24.3 19.3 18 18 17.1 ... ..
..$ PValue: num [1:16689] 8.09e-07 1.12e-05 2.16e-05 2.21e-05 3.62e-05 ... ..
..$ FDR : num [1:16689] 0.0135 0.0921 0.0921 0.0921 0.114 ...
$ C_I:'data.frame': 16689 obs. of 6 variables: ..
..$ ID : chr [1:16689] "ENSMMUG00000051566" "ENSMMUG00000008039" "ENSMMUG00000006643" "ENSMMUG00000052273" ... ..
..$ logFC : num [1:16689] -2.88 -2.1 -1.67 -2.07 -1.77 ... ..
..$ logCPM: num [1:16689] 5.33 3.88 6.19 7.39 5.56 ... ..
..$ LR : num [1:16689] 74.3 58.3 57.8 56.2 56.1 ... ..
..$ PValue: num [1:16689] 6.86e-18 2.24e-14 2.88e-14 6.67e-14 6.90e-14 ... ..
..$ FDR : num [1:16689] 1.14e-13 1.60e-10 1.60e-10 2.30e-10 2.30e-10 ...
$ E_I:'data.frame': 16689 obs. of 6 variables: ..
..$ ID : chr [1:16689] "ENSMMUG00000051566" "ENSMMUG00000050147" "ENSMMUG00000008039" "ENSMMUG00000006643" ... ..
..$ logFC : num [1:16689] -2.89 -2.11 -2.22 -1.67 -1.78 ... ..
..$ logCPM: num [1:16689] 5.33 5.37 3.88 6.19 5.96 ... ..
..$ LR : num [1:16689] 66.7 59.6 58.3 52.5 51.5 ... ..
..$ PValue: num [1:16689] 3.11e-16 1.16e-14 2.28e-14 4.30e-13 7.19e-13 ... ..
..$ FDR : num [1:16689] 5.19e-12 9.70e-11 1.27e-10 1.79e-09 2.21e-09 ...
data_st <- as.data.frame(t(apply(as.matrix(logcpm.df[,-1]), 1, scale)))
data_st$ID <- as.character(logcpm.df$ID)
data_st <- data_st[,c(length(data_st), 1:length(data_st)-1)]
colnames(data_st) <- colnames(logcpm.df)
nID <- which(is.nan(data_st[,2]))
data_st[nID,2:length(data_st)] <- 0
colList = c("#00A600FF", rainbow(5)[c(1,4,5,6)])
str(data_st)
'data.frame': 16689 obs. of 19 variables:
$ ID : chr "ENSMMUG00000064595" "ENSMMUG00000064596" "ENSMMUG00000052583" "ENSMMUG00000052581" ...
$ C.S49: num 0.529 -1.188 -0.383 -0.987 -0.408 ...
$ C.S75: num -1.713 1.658 -0.86 0.64 -0.892 ...
$ C.S47: num 0.9118 -0.9486 -0.916 0.0537 1.0305 ...
$ C.S76: num -0.9971 0.0278 -0.6915 -1.2873 0.0941 ...
$ C.S46: num 0.301 0.569 -0.249 0.496 -1.432 ...
$ C.S50: num 0.335 -0.219 1.689 1.342 -2.201 ...
$ C.S48: num 0.171 -0.94 -0.509 -1.139 1.209 ...
$ E.S55: num -1.5166 0.5469 -0.5164 0.0405 0.1081 ...
$ E.S54: num 1.054 0.832 0.296 0.876 0.815 ...
$ E.S56: num -0.318 1.608 -1.062 -0.686 0.943 ...
$ E.S52: num 1.33 -1.018 -0.565 0.525 1.083 ...
$ E.S53: num 1.25 -0.866 -0.727 -0.639 0.153 ...
$ E.S51: num 0.78 1.066 0.192 -1.216 0.97 ...
$ I.S77: num 1.14 0.47 -1.103 0.205 0.395 ...
$ I.S60: num -0.9896 0.0411 1.3277 0.9305 0.5659 ...
$ I.S58: num -1.058 0.871 0.666 -1.725 -1.174 ...
$ I.S57: num -0.336 -1.415 1.806 1.179 -0.27 ...
$ I.S59: num -0.875 -1.095 1.603 1.392 -0.992 ...
ret1 <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4,
colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = TRUE,yAxisLabel = "Count (log CPM)", vxAxis = TRUE)
names(ret1)
tempdir()
print.ggExtraPlot1 <- function() {
grid.arrange(ret1[["C_E_4"]],top="Estradiol x Control")
grid.newpage()
grid.arrange(ret1[["C_I_4"]],top="Intact Ovary x Control")
grid.newpage()
grid.arrange(ret1[["E_I_4"]],top="Estradiol x Intact Ovary")
}
print.ggExtraPlot1()
Hello PDK:
Thanks for pointing out this issue. I updated the plotClusters()
function to better allow for both functions you suggested in the case of data with more than 2 treatment groups. You should now be able to plot the boxplot/parallel-coordinate-plots only the given pair of treatment groups at a time or for all treatment groups at once. I added a section on the website here with four reproducible examples. Feel free to let me know if any aspect of those four examples do not make sense.
In terms of the specific code you provided, below is a minimal working example with simulated data similar to yours creating similar four examples.
################## Download latest changes from GitHub ##################
install_github("lindsayrutter/bigPint")
################## Make data ##################
set.seed(1)
logcpm.df = data.frame(ID=paste0(replicate(2000, "ID"), 1:2000), C.S49 = round(100*runif(2000)), C.S75 = round(100*runif(2000)), C.S47 = round(100*runif(2000)), C.S76 = round(100*runif(2000)), C.S46 = round(100*runif(2000)), C.S50 = round(100*runif(2000)), C.S48 = round(100*runif(2000)), E.S55 = round(100*runif(2000)), E.S54 = round(100*runif(2000)), E.S56 = round(100*runif(2000)), E.S52 = round(100*runif(2000)), E.S53 = round(100*runif(2000)), E.S51 = round(100*runif(2000)), I.S77 = round(100*runif(2000)), I.S60 = round(100*runif(2000)), I.S58 = round(100*runif(2000)), I.S57 = round(100*runif(2000)), I.S59 = round(100*runif(2000)))
logcpm.df$ID = as.character(logcpm.df$ID)
rownames(logcpm.df) = logcpm.df$ID
################## Make dataMetrics ##################
y = DGEList(counts=logcpm.df[,-1])
group = c(1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
y = DGEList(counts=y, group=group)
Group = factor(c("C","C","C","C","C","C","C","E","E","E","E","E","E","I","I","I","I","I"))
design <- model.matrix(~0+Group, data=y$samples)
colnames(design) <- levels(Group)
y <- estimateDisp(y, design)
fit <- glmFit(y, design) # changed y to d
dataMetrics <- list()
for (i in 1:(ncol(fit)-1)){
for (j in (i+1):ncol(fit)){
contrast=rep(0,ncol(fit))
contrast[i]=1
contrast[j]=-1
lrt <- glmLRT(fit, contrast=contrast)
lrt <- topTags(lrt, n = nrow(y[[1]]))[[1]]
setDT(lrt, keep.rownames = TRUE)[]
colnames(lrt)[1] = "ID"
lrt <- as.data.frame(lrt)
dataMetrics[[paste0(colnames(fit)[i], "_", colnames(fit)[j])]] <- lrt
}
}
################## Make data_st ##################
data_st <- as.data.frame(t(apply(as.matrix(logcpm.df[,-1]), 1, scale)))
data_st$ID <- as.character(logcpm.df$ID)
data_st <- data_st[,c(length(data_st), 1:length(data_st)-1)]
colnames(data_st) <- colnames(logcpm.df)
nID <- which(is.nan(data_st[,2]))
data_st[nID,2:length(data_st)] <- 0
colList = c("#00A600FF", rainbow(5)[c(1,4,5,6)])
################## Make four example plots ##################
retTF <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4, colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = TRUE, showPairs = FALSE, vxAxis = TRUE)
grid.draw(retTF[["C_E_4"]])
retTT <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4, colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = TRUE, showPairs = TRUE, vxAxis = TRUE)
grid.draw(retTT[["C_E_4"]])
retFF <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4, colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = FALSE, showPairs = FALSE, vxAxis = TRUE)
grid.draw(retFF[["C_E_4"]])
retFT <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4, colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = FALSE, showPairs = TRUE, vxAxis = TRUE)
grid.draw(retFT[["C_E_4"]])
I updated the development version of bigPint
on Bioconductor from version 1.5.0 to version 1.5.1 to add these new features. The new version may take several days to be available for download on the website (which, as of now, is still version 1.5.0), so if you want to run the updated code right now, you may want to temporarily use the latest version available on the bigPint
GitHub page (by running install_github("lindsayrutter/bigPint"
) (see the first line in the example code above).
After testing it out, feel free to let me know if you have any questions and/or if the updated code works for your task. Thanks again.
Sincerely, Lindsay
Thanks, Lindsay. I really appreciate it. I will test it out and let you know.
It worked like a charm. Thanks, Lindsay.
Wonderful. Thanks, PDK.
Hi Lindsay, I am generating the cluster plots for my data. What I get is, for example: As you can see, the background boxplots for one of the groups are not displayed. The only ones displayed are those for the comparison in question. I would like to better understand why, and if we can change the plot to either not show the group that is not represented or to display the box plots for all groups. What am I missing? Thanks PDK