markrobinsonuzh / cytofWorkflow

MIT License
14 stars 3 forks source link

plotClusterExprs memory and data extraction #23

Closed antoine4ucsd closed 3 years ago

antoine4ucsd commented 3 years ago

Hello I am using CyTOF workflow to process a large dataset of flow data. Everything works really fine apart from the ClusterExpression density plot: Rstudio is running out of memory. Is there a way we can avoid that (increasing the mem limit? or downsampling ?) I was able to run it once and got the attached results. Is there a way to extract the marker 'density' data for all clusters from the sce input?

thank you, _output

markrobinsonuzh commented 3 years ago

Dear @antoine4ucsd

Just to be clear .. you are running plotClusterExprs(), right?

If you look at the code ..

> plotClusterExprs
function (x, k = "meta20", features = "type") 
{
    .check_sce(x, TRUE)
    k <- .check_k(x, k)
    x$cluster_id <- cluster_ids(x, k)
    features <- .get_features(x, features)
    ms <- t(.agg(x[features, ], "cluster_id", "median"))
    d <- dist(ms, method = "euclidean")
    o <- hclust(d, method = "average")$order
    cd <- colData(x)
    es <- assay(x[features, ], "exprs")
    df <- data.frame(t(es), cd, check.names = FALSE)
    df <- melt(df, id.vars = names(cd), variable.name = "antigen", 
        value.name = "expression")
    df$avg <- "no"
    avg <- df
    avg$cluster_id <- "avg"
    avg$avg <- "yes"
    df <- rbind(df, avg)
    fq <- tabulate(x$cluster_id)/ncol(x)
    fq <- round(fq * 100, 2)
    names(fq) <- levels(x$cluster_id)
    df$cluster_id <- factor(df$cluster_id, levels = rev(c("avg", 
        levels(x$cluster_id)[o])), labels = rev(c("average", 
        paste0(names(fq), " (", fq, "%)")[o])))
    ggplot(df, aes_string(x = "expression", y = "cluster_id", 
        col = "avg", fill = "avg")) + facet_wrap(~antigen, scales = "free_x", 
        nrow = 2) + geom_density_ridges(alpha = 0.2) + theme_ridges() + 
        theme(legend.position = "none", strip.background = element_blank(), 
            strip.text = element_text(face = "bold"))
}
<bytecode: 0x7fb66d6f4ba8>
<environment: namespace:CATALYST>

.. the calculation is actually done by ggridges::geom_density_ridges and so is external to CATALYST.

But, I suppose the calculation that geom_density_ridges() is doing could be made less memory intensive by manually looping through each marker and cluster .. calculating all the densities with the density() function or something like that.

As you mention, down-sampling should also work, especially well for the bigger clusters .. so you might want to down-sample in a cluster-wise fashion ..

Best, Mark

antoine4ucsd commented 3 years ago

thank you for your detailed answer. really helpful. I will look into my SingleCellExperiment object and try downsampling.

Best,