nolanlab / spade

SPADE: Spanning Tree Progression of Density Normalized Events
Other
46 stars 23 forks source link

Single Cell Cluster Deletion #129

Closed gdreiman1 closed 7 years ago

gdreiman1 commented 7 years ago

In the 2016 Nature Protocols paper, the troubleshooting section explains that SPADE drops any cluster containing a single cell from the graph:

When the number of input cells is close to the value of k (for example, when we used SPADE to visualize single-cell RNA-seq data with close to single-cell resolution), fewer clusters will be produced than were expected on the basis of the input k-value. This occurs because many clusters contain only one cell, and SPADE drops any clusters that contain only one cell. The ‘SPADE.cluster’ function can be easily modified to avoid this behavior.

I've been trying to find and modify this function so that I can run SPADE on a small (~100 cell) dataset. However, I can't seem to find a function named SPADE.cluster in the package I downloaded. I did find a file named cluster.R (on the github page) that seems to have a section checking cluster size in lines 15-28:

# Invalid clusters have assgn == 0
    centers = c()
    is.na(clust$assgn) <- which(clust$assgn == 0)
    for (i in c(1:max(clust$assgn, na.rm=TRUE))) {  
        obs <- which(clust$assgn == i)
        if (length(obs) > 1) {
            centers <- rbind(centers,colMeans(tbl[obs,,drop=FALSE]))
            clust$assgn[obs] <- nrow(centers)
        } else {
            is.na(clust$assgn) <- obs
        }
    }
    return(list(centers=centers,assign=clust$assgn,hclust=cluster))
}

Is this the code that I need to modify? And if so how would I modify it and then incorporate the modification into my downloaded SPADE package?

My initial instinct is to change if (length(obs) > 1) to if (length(obs) >= 1), but I'm not sure if colMeans() will work if obs has a single entry.

zbjornson commented 7 years ago

That's the correct file, and I think that's a good change to try!

SamGG commented 7 years ago

Hi, Thanks for pointing the article. In fact you did find the right function and code. The name of the function is stated at the 2nd line of the cluster.R file. https://github.com/nolanlab/spade/blob/master/R/cluster.R#L2 Your proposal of changing > to >= is correct. Because of drop=FALSE, a single observation will be kept as a matrix of one row, and colMeans will work. If you are used to work with package under RStudio (or wish to do it), I think you should try this official way. The quick and dirty way consists in modifying directly the code of the cluster.R file in your installed SPADE library. You could find the right place/directory using path.package("spade") (after you installed the SPADE package). I am not a fan of using SPADE for small single cell RNA-seq data analysis. See comment https://github.com/nolanlab/spade/issues/122#issuecomment-198071225, https://github.com/nolanlab/spade/issues/128#issuecomment-237693735 for alternatives. HTH

gdreiman1 commented 7 years ago

Thanks for the quick replies! @SamGG Can you expand on the "quick and dirty way"? The issue I'm having is that I can't find any cluster.R in my SPADE library. I'm not sure why this is. Is there a way to download the cluster.R file from github to use in my library?

SamGG commented 7 years ago

Ooops, sorry, I am completely wrong, the quick and dirty does not work. So, the proper way consists in building the new package:

Happy coding, HTH

gdreiman1 commented 7 years ago

Got it working! Thanks for such a thorough explanation, I really appreciate it.

SamGG commented 7 years ago

Although it is out of scope of the SPADE, have a look at CIDR http://www.biorxiv.org/content/early/2016/08/10/068775 https://github.com/VCCRI/CIDR that challenges state-of-the-art methods, including tSNE.