lindsayrutter / bigPint

Bioconductor package that makes BIG data pint-sized.
https://lindsayrutter.github.io/bigPint/
20 stars 7 forks source link

Parallel Coordinate Cluster Plots for more than 2 groups #8

Closed parcbioinfo closed 4 years ago

parcbioinfo commented 4 years ago

Hi Lindsay, I am generating the cluster plots for my data. What I get is, for example: image As you can see, the background boxplots for one of the groups are not displayed. The only ones displayed are those for the comparison in question. I would like to better understand why, and if we can change the plot to either not show the group that is not represented or to display the box plots for all groups. What am I missing? Thanks PDK

lindsayrutter commented 4 years ago

Hello PDK:

Thanks for writing about this issue. Yes, I can definitely look into this problem. Would you be able to paste the code you used to create this plot and the structure of your data_st and dataMetrics objects (i.e. the output of running str(data_st) and str(dataMetrics))? I think you provided that output earlier in Issue 6 but it seems to have been deleted (hopefully you resolved that problem on your own otherwise I can try to help with that too). Thanks again!

Sincerely, Lindsay

parcbioinfo commented 4 years ago

Thanks for your response. Here it is:

logcpm.df<-as.data.frame(cbind.data.frame(ID=rownames(logcpm),logcpm))
names(logcpm.df)
names(logcpm.df)<-c("ID","I.S77", "E.S55", "C.S49", "E.S54", "I.S60", "C.S75", "E.S56", "I.S58", "C.S47", "C.S76", "C.S46", "E.S52", "E.S53", "E.S51", "C.S50", "C.S48", "I.S57", "I.S59")
names(logcpm.df)

#Ordering samples columns so that data_st and dataMetrics order matches
logcpm.df <- logcpm.df[c(1,4,7,10,11,12,16,17,3,5,8,13,14,15,2,6,9,18,19)]
logcpm.df$ID<-as.character(logcpm.df$ID)
names(logcpm.df)
logcpm.df <- logcpm.df[logcpm.df$ID %in% rownames(y$counts),]

Group = factor(c("C","C","C","C","C","C","C","E","E","E","E","E","E","I","I","I","I","I"))
design <- model.matrix(~0+Group, data=logcpm.df[,2:19])
colnames(design) <- levels(Group)
d <- estimateDisp(logcpm.df[,2:19], design)
fit <- glmFit(y, design)

dataMetrics <- list()
for (i in 1:(ncol(fit)-1)){
    for (j in (i+1):ncol(fit)){
        contrast=rep(0,ncol(fit))
        contrast[i]=1
        contrast[j]=-1
        lrt <- glmLRT(fit, contrast=contrast)
        lrt <- topTags(lrt, n = nrow(y[[1]]))[[1]]
        setDT(lrt, keep.rownames = TRUE)[]
        colnames(lrt)[1] = "ID"
        lrt <- as.data.frame(lrt)
        dataMetrics[[paste0(colnames(fit)[i], "_", colnames(fit)[j])]] <- lrt
    }
}

str(dataMetrics, strict.width = "wrap")
List of 3
$ C_E:'data.frame': 16689 obs. of 6 variables: ..
..$ ID : chr [1:16689] "ENSMMUG00000015308" "ENSMMUG00000004347" "ENSMMUG00000007338" "ENSMMUG00000050593" ... ..
..$ logFC : num [1:16689] -2.343 -1.451 -0.708 2.712 -1.236 ... ..
..$ logCPM: num [1:16689] -0.328 0.736 6.448 -1.045 3.786 ... ..
..$ LR : num [1:16689] 24.3 19.3 18 18 17.1 ... ..
..$ PValue: num [1:16689] 8.09e-07 1.12e-05 2.16e-05 2.21e-05 3.62e-05 ... ..
..$ FDR : num [1:16689] 0.0135 0.0921 0.0921 0.0921 0.114 ...
$ C_I:'data.frame': 16689 obs. of 6 variables: ..
..$ ID : chr [1:16689] "ENSMMUG00000051566" "ENSMMUG00000008039" "ENSMMUG00000006643" "ENSMMUG00000052273" ... ..
..$ logFC : num [1:16689] -2.88 -2.1 -1.67 -2.07 -1.77 ... ..
..$ logCPM: num [1:16689] 5.33 3.88 6.19 7.39 5.56 ... ..
..$ LR : num [1:16689] 74.3 58.3 57.8 56.2 56.1 ... ..
..$ PValue: num [1:16689] 6.86e-18 2.24e-14 2.88e-14 6.67e-14 6.90e-14 ... ..
..$ FDR : num [1:16689] 1.14e-13 1.60e-10 1.60e-10 2.30e-10 2.30e-10 ...
$ E_I:'data.frame': 16689 obs. of 6 variables: ..
..$ ID : chr [1:16689] "ENSMMUG00000051566" "ENSMMUG00000050147" "ENSMMUG00000008039" "ENSMMUG00000006643" ... ..
..$ logFC : num [1:16689] -2.89 -2.11 -2.22 -1.67 -1.78 ... ..
..$ logCPM: num [1:16689] 5.33 5.37 3.88 6.19 5.96 ... ..
..$ LR : num [1:16689] 66.7 59.6 58.3 52.5 51.5 ... ..
..$ PValue: num [1:16689] 3.11e-16 1.16e-14 2.28e-14 4.30e-13 7.19e-13 ... ..
..$ FDR : num [1:16689] 5.19e-12 9.70e-11 1.27e-10 1.79e-09 2.21e-09 ...

data_st <- as.data.frame(t(apply(as.matrix(logcpm.df[,-1]), 1, scale)))
data_st$ID <- as.character(logcpm.df$ID)
data_st <- data_st[,c(length(data_st), 1:length(data_st)-1)]
colnames(data_st) <- colnames(logcpm.df)
nID <- which(is.nan(data_st[,2]))
data_st[nID,2:length(data_st)] <- 0
colList = c("#00A600FF", rainbow(5)[c(1,4,5,6)])

str(data_st)
'data.frame': 16689 obs. of 19 variables:
$ ID : chr "ENSMMUG00000064595" "ENSMMUG00000064596" "ENSMMUG00000052583" "ENSMMUG00000052581" ...
$ C.S49: num 0.529 -1.188 -0.383 -0.987 -0.408 ...
$ C.S75: num -1.713 1.658 -0.86 0.64 -0.892 ...
$ C.S47: num 0.9118 -0.9486 -0.916 0.0537 1.0305 ...
$ C.S76: num -0.9971 0.0278 -0.6915 -1.2873 0.0941 ...
$ C.S46: num 0.301 0.569 -0.249 0.496 -1.432 ...
$ C.S50: num 0.335 -0.219 1.689 1.342 -2.201 ...
$ C.S48: num 0.171 -0.94 -0.509 -1.139 1.209 ...
$ E.S55: num -1.5166 0.5469 -0.5164 0.0405 0.1081 ...
$ E.S54: num 1.054 0.832 0.296 0.876 0.815 ...
$ E.S56: num -0.318 1.608 -1.062 -0.686 0.943 ...
$ E.S52: num 1.33 -1.018 -0.565 0.525 1.083 ...
$ E.S53: num 1.25 -0.866 -0.727 -0.639 0.153 ...
$ E.S51: num 0.78 1.066 0.192 -1.216 0.97 ...
$ I.S77: num 1.14 0.47 -1.103 0.205 0.395 ...
$ I.S60: num -0.9896 0.0411 1.3277 0.9305 0.5659 ...
$ I.S58: num -1.058 0.871 0.666 -1.725 -1.174 ...
$ I.S57: num -0.336 -1.415 1.806 1.179 -0.27 ...
$ I.S59: num -0.875 -1.095 1.603 1.392 -0.992 ...

ret1 <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4,
  colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = TRUE,yAxisLabel = "Count (log CPM)", vxAxis = TRUE)
names(ret1)
tempdir()
print.ggExtraPlot1 <- function() { 
  grid.arrange(ret1[["C_E_4"]],top="Estradiol x Control")
  grid.newpage()
  grid.arrange(ret1[["C_I_4"]],top="Intact Ovary x Control")
  grid.newpage()
  grid.arrange(ret1[["E_I_4"]],top="Estradiol x Intact Ovary")
} 

print.ggExtraPlot1()
lindsayrutter commented 4 years ago

Hello PDK:

Thanks for pointing out this issue. I updated the plotClusters() function to better allow for both functions you suggested in the case of data with more than 2 treatment groups. You should now be able to plot the boxplot/parallel-coordinate-plots only the given pair of treatment groups at a time or for all treatment groups at once. I added a section on the website here with four reproducible examples. Feel free to let me know if any aspect of those four examples do not make sense.

In terms of the specific code you provided, below is a minimal working example with simulated data similar to yours creating similar four examples.

################## Download latest changes from GitHub ################## 
install_github("lindsayrutter/bigPint")

################## Make data ################## 
set.seed(1)
logcpm.df = data.frame(ID=paste0(replicate(2000, "ID"), 1:2000), C.S49 = round(100*runif(2000)), C.S75 = round(100*runif(2000)), C.S47 = round(100*runif(2000)), C.S76 = round(100*runif(2000)), C.S46 = round(100*runif(2000)), C.S50 = round(100*runif(2000)), C.S48 = round(100*runif(2000)), E.S55 = round(100*runif(2000)), E.S54 = round(100*runif(2000)), E.S56 = round(100*runif(2000)), E.S52 = round(100*runif(2000)), E.S53 = round(100*runif(2000)), E.S51 = round(100*runif(2000)), I.S77 = round(100*runif(2000)), I.S60 = round(100*runif(2000)), I.S58 = round(100*runif(2000)), I.S57 = round(100*runif(2000)), I.S59 = round(100*runif(2000)))
logcpm.df$ID = as.character(logcpm.df$ID)
rownames(logcpm.df) = logcpm.df$ID

################## Make dataMetrics ################## 
y = DGEList(counts=logcpm.df[,-1])
group = c(1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
y = DGEList(counts=y, group=group)
Group = factor(c("C","C","C","C","C","C","C","E","E","E","E","E","E","I","I","I","I","I"))
design <- model.matrix(~0+Group, data=y$samples)
colnames(design) <- levels(Group)
y <- estimateDisp(y, design)
fit <- glmFit(y, design) # changed y to d

dataMetrics <- list()
for (i in 1:(ncol(fit)-1)){
    for (j in (i+1):ncol(fit)){
        contrast=rep(0,ncol(fit))
        contrast[i]=1
        contrast[j]=-1
        lrt <- glmLRT(fit, contrast=contrast)
        lrt <- topTags(lrt, n = nrow(y[[1]]))[[1]]
        setDT(lrt, keep.rownames = TRUE)[]
        colnames(lrt)[1] = "ID"
        lrt <- as.data.frame(lrt)
        dataMetrics[[paste0(colnames(fit)[i], "_", colnames(fit)[j])]] <- lrt
    }
}

################## Make data_st ##################
data_st <- as.data.frame(t(apply(as.matrix(logcpm.df[,-1]), 1, scale)))
data_st$ID <- as.character(logcpm.df$ID)
data_st <- data_st[,c(length(data_st), 1:length(data_st)-1)]
colnames(data_st) <- colnames(logcpm.df)
nID <- which(is.nan(data_st[,2]))
data_st[nID,2:length(data_st)] <- 0
colList = c("#00A600FF", rainbow(5)[c(1,4,5,6)])

################## Make four example plots ##################
retTF <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4, colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = TRUE, showPairs = FALSE, vxAxis = TRUE)

grid.draw(retTF[["C_E_4"]])

retTT <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4, colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = TRUE, showPairs = TRUE, vxAxis = TRUE)

grid.draw(retTT[["C_E_4"]])

retFF <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4, colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = FALSE, showPairs = FALSE, vxAxis = TRUE)

grid.draw(retFF[["C_E_4"]])

retFT <- plotClusters(data_st, dataMetrics, threshVar="PValue" ,threshVal = 0.05, nC = 4, colList = colList, lineSize = 0.5, verbose = TRUE, clusterAllData = FALSE, showPairs = TRUE, vxAxis = TRUE)

grid.draw(retFT[["C_E_4"]])

I updated the development version of bigPint on Bioconductor from version 1.5.0 to version 1.5.1 to add these new features. The new version may take several days to be available for download on the website (which, as of now, is still version 1.5.0), so if you want to run the updated code right now, you may want to temporarily use the latest version available on the bigPint GitHub page (by running install_github("lindsayrutter/bigPint") (see the first line in the example code above).

After testing it out, feel free to let me know if you have any questions and/or if the updated code works for your task. Thanks again.

Sincerely, Lindsay

parcbioinfo commented 4 years ago

Thanks, Lindsay. I really appreciate it. I will test it out and let you know.

parcbioinfo commented 4 years ago

It worked like a charm. Thanks, Lindsay.

lindsayrutter commented 4 years ago

Wonderful. Thanks, PDK.