Single Cell Cluster Deletion

gdreiman1 commented 7 years ago

In the 2016 Nature Protocols paper, the troubleshooting section explains that SPADE drops any cluster containing a single cell from the graph:

When the number of input cells is close to the value of k (for example, when we used SPADE to visualize single-cell RNA-seq data with close to single-cell resolution), fewer clusters will be produced than were expected on the basis of the input k-value. This occurs because many clusters contain only one cell, and SPADE drops any clusters that contain only one cell. The ‘SPADE.cluster’ function can be easily modified to avoid this behavior.

I've been trying to find and modify this function so that I can run SPADE on a small (~100 cell) dataset. However, I can't seem to find a function named SPADE.cluster in the package I downloaded. I did find a file named cluster.R (on the github page) that seems to have a section checking cluster size in lines 15-28:

# Invalid clusters have assgn == 0
    centers = c()
    is.na(clust$assgn) <- which(clust$assgn == 0)
    for (i in c(1:max(clust$assgn, na.rm=TRUE))) {  
        obs <- which(clust$assgn == i)
        if (length(obs) > 1) {
            centers <- rbind(centers,colMeans(tbl[obs,,drop=FALSE]))
            clust$assgn[obs] <- nrow(centers)
        } else {
            is.na(clust$assgn) <- obs
        }
    }
    return(list(centers=centers,assign=clust$assgn,hclust=cluster))
}

Is this the code that I need to modify? And if so how would I modify it and then incorporate the modification into my downloaded SPADE package?

My initial instinct is to change if (length(obs) > 1) to if (length(obs) >= 1), but I'm not sure if colMeans() will work if obs has a single entry.

zbjornson commented 7 years ago

That's the correct file, and I think that's a good change to try!

SamGG commented 7 years ago

Hi, Thanks for pointing the article. In fact you did find the right function and code. The name of the function is stated at the 2nd line of the cluster.R file. https://github.com/nolanlab/spade/blob/master/R/cluster.R#L2 Your proposal of changing > to >= is correct. Because of drop=FALSE, a single observation will be kept as a matrix of one row, and colMeans will work. If you are used to work with package under RStudio (or wish to do it), I think you should try this official way. The quick and dirty way consists in modifying directly the code of the cluster.R file in your installed SPADE library. You could find the right place/directory using path.package("spade") (after you installed the SPADE package). I am not a fan of using SPADE for small single cell RNA-seq data analysis. See comment https://github.com/nolanlab/spade/issues/122#issuecomment-198071225, https://github.com/nolanlab/spade/issues/128#issuecomment-237693735 for alternatives. HTH

gdreiman1 commented 7 years ago

Thanks for the quick replies! @SamGG Can you expand on the "quick and dirty way"? The issue I'm having is that I can't find any cluster.R in my SPADE library. I'm not sure why this is. Is there a way to download the cluster.R file from github to use in my library?

SamGG commented 7 years ago

Ooops, sorry, I am completely wrong, the quick and dirty does not work. So, the proper way consists in building the new package:

on Github, go to spade repo https://github.com/nolanlab/spade, and click the fork button on the top right; now you have your own copy of spade
install RStudio if not already done
create a new project, select Version Control in the wizard, then select Git, then enter the URL of your spade fork; you get this URL by clicking the green button Clone of the spade repo of your account at https://github.com/gdreiman1/spade. Alternatively, you can enter the git URL from the nonlab/spade repo, but using your own repo will allow you to save and share your changes.
Finally click Create. This will create the project on your disk and download the repo to your disk. You are nearly done. Notice that you have got two new features in the panel of the environment: Build and Git. Git allows you to commit your changes to your local repo, and then to push this commit to Github if you want. Build permits to build the package. This is what you are going to use.
Now you can access all the files of the spade package comprehensively. You can apply your change.
When you are ready to test your changes, select the Build panel and then either click Build & reload or in the More menu, click Clean and rebuild.
If everything goes OK, you should read a message that says that the package has been installed. Be careful, if the package is already in use in another R/RStudio session, the won't install, because some file (usually .dll) is already loaded in memory. So if you want to test spade with your data while changing the code, the best is to do testing in spade's project/session.

Happy coding, HTH

gdreiman1 commented 7 years ago

Got it working! Thanks for such a thorough explanation, I really appreciate it.

SamGG commented 7 years ago

Although it is out of scope of the SPADE, have a look at CIDR http://www.biorxiv.org/content/early/2016/08/10/068775 https://github.com/VCCRI/CIDR that challenges state-of-the-art methods, including tSNE.

nolanlab / spade

Single Cell Cluster Deletion #129