lgatto / pRoloc

A unifying bioinformatics framework for organelle proteomics
http://lgatto.github.io/pRoloc/
15 stars 13 forks source link

Function to plot average marker profile as heatmap #130

Closed TomSmithCGAT closed 4 years ago

TomSmithCGAT commented 5 years ago

I'm currently comparing sedimentation-based LOPIT and density-based LOPIT approaches and wanted some easy to understand visualisations that will help to see how the organelles/localisations separate differently by these two physiochemical properties. Normally, I would use linear profile plots (e.g plotDist) to make such a comparison. But this gets very messy when plotting all marker sets, even if just presenting the average profile (see Figure 1 below).


dlopit_marker_profile_linear

Figure 1. Linear representation of average marker profile for dLOPIT


I've settled on a heatmap representation instead as I find this much easier to compare the profiles between organelles and also between different LOPITs (See Figure 2 & 3 below). As pointed out by @lmsimp, these are akin to Western Blots, except that the profile is derived from all markers rather than a single protein.

If this is something that can't already be done within pRoloc, would you consider adding it? I've included the code used to make these plots below but I'm sure there are internal pRoloc functions I should have used. The plotting is done via ggplot but I guess this would better done with base R for pRoloc. Also, I wanted to order the marker sets according to a simple heirachical clustering. I couldn't see how to return the hlust object using mrkHClust so within the plot function there is a clustering with hlcust which would be best removed and replaced with a optional argument to order the markers by a user defined order.

If you think we could add this to pRoloc, I'm happy to issue a pull request. I would appreciate some pointers on where to modify the code to keep it in line with current pRoloc functions though.


slopit_marker_profile_heatmap

Figure 2. Marker profiles heatmap: sLOPIT


dlopit_marker_profile_heatmap Figure 3. Marker profiles heatmap: dLOPIT


plotAverageMarkerProfilesHeatmap <- function(obj, fcol="markers"){
  exprs_df <- exprs(obj)
  long_exprs_df <- melt(exprs_df)

  colnames(long_exprs_df) <- c("feature", "fraction", "intensity")

  # nrows(long_exprs_df) == nrows(fData(obj))*ncols(exprs_df) so this is OK
  long_exprs_df$markers <- getMarkers(obj, fcol = fcol, verbose = FALSE) 

  long_exprs_df <- long_exprs_df %>%
    filter(markers!="unknown") %>% # pointless averaging unknown proteins in this representation
    group_by(markers, fraction) %>%
    summarise(intensity=mean(intensity))

  # Get marker set ordering according to hierarchical clustering
  # As currently implemented, requires reshape to wide-format matrix for dist
  # Would ideally use mrkHClust for this if could return hclust object from mrkHClust
  average_profile_matrix <- long_exprs_df %>% spread(key=fraction, value=intensity) %>% data.frame()
  rownames(average_profile_matrix) <- average_profile_matrix$markers

  average_profile_matrix <- average_profile_matrix %>% dplyr::select(-markers)

  clust <- hclust(dist(as.matrix(average_profile_matrix)))
  my_order <- rownames(average_profile_matrix)[clust$order]

  # Relevel factor by new marker set order
  long_exprs_df$markers <- factor(long_exprs_df$markers, my_order)

  # plot
  p <- ggplot(long_exprs_df, aes(factor(fraction), markers, fill=intensity)) +
    geom_tile() +
    scale_fill_continuous(low="white", high="#56B4E9", limits=c(0,NA), name="Normalised\nabundance") +
    theme_bw() +
    xlab("Fraction") +
    ylab("") +
    theme(text=element_text(size=12),
          axis.text.x=element_text(angle=45, vjust=1, hjust=1),
          aspect.ratio=1)

  return(p)
}
lgatto commented 5 years ago

What you suggest is essentially the following:

> library("pRoloc")
> library("pRolocdata")
> data(dunkley2006)
> x <- mrkConsProfiles(dunkley2006)
> lattice::levelplot(t(x))

using ggplot2. I would be happy to accept a pull request for a function to nicely visualise consensus profiles, but it is important to not used non-standard evaluation in functions.

As for I couldn't see how to return the hlust object using mrkHClust, you just need to store the return value of the function, as documented in the man page:


> x <- mrkHClust(dunkley2006)
> x
                   M1F1A     M1F4A     M1F7A    M1F11A      M1F2B     M1F5B
ER lumen      0.34790193 0.2778745 0.2000145 0.1743091 0.49310871 0.2030229
ER membrane   0.26954542 0.3094276 0.2201106 0.2010234 0.37832533 0.2433542
Golgi         0.10673314 0.2228814 0.3508332 0.3196275 0.12041797 0.2295954
Mitochondrion 0.09008695 0.1864223 0.2975309 0.4259333 0.09515331 0.1942079
Plastid       0.08200481 0.1499930 0.2870436 0.4809921 0.08937813 0.1255232
PM            0.18493326 0.2926176 0.2368301 0.2856681 0.28288228 0.2230173
                  M1F8B    M1F11B      M2F1A      M2F4A     M2F7A    M2F11A
ER lumen      0.1632424 0.1405430 0.38528500 0.30223829 0.1643254 0.1480823
ER membrane   0.1990741 0.1793234 0.24821447 0.34158018 0.2421851 0.1680087
Golgi         0.3546319 0.2952720 0.06979120 0.24087721 0.4075620 0.2818049
Mitochondrion 0.3043015 0.4063777 0.05297090 0.15590611 0.3271555 0.4639317
Plastid       0.3327774 0.4523463 0.05931388 0.09338429 0.2421152 0.6051806
PM            0.2244869 0.2695825 0.14423628 0.25059815 0.2784118 0.3266328
                   M2F2B      M2F5B     M2F8B    M2F11B
ER lumen      0.41181264 0.28204521 0.1614619 0.1446546
ER membrane   0.29584151 0.30213218 0.2223314 0.1797237
Golgi         0.09211270 0.25888696 0.3764906 0.2726121
Mitochondrion 0.05592373 0.18116613 0.3156597 0.4472919
Plastid       0.06704418 0.09906525 0.3149460 0.5190335
PM            0.17748167 0.22124237 0.2801925 0.3210563
 [ reached getOption("max.print") -- omitted 3 rows ]
TomSmithCGAT commented 5 years ago

Thanks @lgatto. I had completely missed mrkConsProfiles. This is exactly what I was after!

I should have been clearer about mrkHClust though. I looked at the source code and ascertained that it was returning the fmat object containing the average profiles, whereas I was after the hc object that holds the dendrogram. Looking at the two functions together, it seems to me there's some code repetition. The top of mrkHClust:

object <- markerMSnSet(object)
mm <- getMarkers(object, fcol = fcol, verbose = FALSE)
umm <- levels(factor(mm))
fmat <- matrix(NA, nrow = length(umm), ncol = ncol(object))
rownames(fmat) <- umm
colnames(fmat) <- sampleNames(object)
for (m in umm) fmat[m, ] <- colMeans(exprs(object[mm == m, 
    ]), na.rm = TRUE)

is essentially the same as mrkConsProfiles unless I'm mistaken? (With the exception that mrkConsProfiles allows non mean averaging)

<environment: namespace:pRoloc>
function (object, fcol = "markers", method = mean) 
{
    object <- markerMSnSet(object, fcol)
    cl <- getMarkerClasses(object, fcol)
    ind <- lapply(cl, function(x) which(fData(object)[, fcol] == 
        x))
    names(ind) <- cl
    profs <- lapply(ind, function(x) exprs(object)[x, , drop = FALSE])
    mm <- t(sapply(profs, function(z) apply(z, 2, method)))
    return(mm)
}

To my mind, it would be more logical for mrkHClust to just perform the clustering and to return the dendrogram object rather than the average profile. I can see this change would introduce backward compatibility issues though. The following would solve that (at the cost of introducing a new function)

library("pRoloc")
library("pRolocdata")
data(tan2009r1)

getMrkHClust <- function(fmat, distargs, hclustargs, plot = TRUE, ...){
    if (missing(distargs)) 
        distargs <- list(x = fmat)
    else distargs <- c(list(x = fmat), distargs)
    d <- do.call(dist, distargs)
    if (missing(hclustargs)) 
        hclustargs <- list(d = d)
    else hclustargs <- c(list(d = d), hclustargs)
    hc <- do.call(hclust, hclustargs)
    hc <- as.dendrogram(hc)
    i <- match(labels(hc), rownames(fmat))
    hc <- set(hc, "labels_colors", getStockcol()[i])
    invisible(hc)
}

mrkHClust <- function(object, fcol="markers", distargs, hclustargs, plot = TRUE, ...){
    fmat <- mrkConsProfiles(object, fcol, method="mean")
    hc <- getMrkHClust(fmat, distargs, hclustargs)
    if (plot) 
        plot(hc, ...)
    invisible(fmat)
}

mrkHClust(tan2009r1)

This might seem to be getting off topic but it would help for me when it comes to making the above plots which would now be as simple as


plotAverageMarkerProfilesHeatmap <- function(fmat, order=NULL){

  fmat <- melt(fmat) # uses reshape2. Is there a simple base R equivalent to melt?
  colnames(fmat) <- c("feature", "fraction", "intensity") # replace generic names given by melt

  if(!missing(order)){
    fmat$feature <- factor(fmat$feature, levels)
  }

  p <- ggplot(fmat, aes(factor(fraction), feature, fill=intensity)) +
    geom_tile() +
    scale_fill_continuous(low="white", high="#56B4E9", limits=c(0,NA), name="Normalised\nabundance") +
    theme_bw() +
    xlab("Fraction") +
    ylab("") +
    theme(text=element_text(size=12),
          axis.text.x=element_text(angle=45, vjust=1, hjust=1),
          aspect.ratio=1)
  print(p)  
}

fmat <- mrkConsProfiles(tan2009r1, fcol="markers")
dc <- getMrkHClust(fmat) # Can now return dendrogram

# extract order from dendrogram. Arguably this would be better done within the plotting
# function and the dendrogram plotted on the side of the heatmap
mm <- getMarkers(tan2009r1, fcol="markers", verbose = FALSE)
umm <- levels(factor(mm))
order <- umm[order.dendrogram(dc)]

plotAverageMarkerProfilesHeatmap(fmat)

When we have reached agreement on the correct approach, I'm happy to make a PR for this plot function.

lgatto commented 5 years ago

The latest version available on github has been modified so that mrkHClust now returns the hclust dendrogram object. Regarding reshape2, see some examples here for some basic R code to emulate melt.