hemberg-lab / SC3

A tool for the unsupervised clustering of cells from single cell RNA-Seq experiments
http://bioconductor.org/packages/SC3
GNU General Public License v3.0
118 stars 55 forks source link

Large number of NA's when run on large dataset. #68

Closed TChan92 closed 5 years ago

TChan92 commented 6 years ago

Hi, I'm trying to run SC3 on a SingleCellExperiment which is a combination of the following datasets: CD14+ Monocytes CD19+ B Cells CD34+ Cells CD4+ Helper T Cells CD4+/CD25+ Regulatory T Cells CD4+/CD45RA+/CD25- Naive T cells CD4+/CD45RO+ Memory T Cells CD56+ Natural Killer Cells CD8+ Cytotoxic T cells CD8+/CD45RA+ Naive Cytotoxic T Cells

from https://support.10xgenomics.com/single-cell-gene-expression/datasets.

After running SC3, I get a large number of NAs in the results. table(colData(sc3_result)$sc3_2_clusters, exclude = NULL) gives 3360 cells in cluster 1, 1640 cells in cluster 2, and 89655 NAs.

I've gotten SC3 to run successfully before on smaller datasets. Like the Frozen PBMCs (Donor A) from the above link.

I've tried running SC3 multiple times, and also get similar issues with other large datasets.

wikiselev commented 6 years ago

Hello, can you provide exact commands you use to run SC3? Also, exactly which dataset provided NAs?

TChan92 commented 6 years ago

The same issue can be reproduced on a smaller dataset found here. https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/frozen_pbmc_donor_c. To run, just use the following code set data_dir to the directory containing the 3 files from the dataset above. ` library(SingleCellExperiment) library(SC3) library(scater)

pbmc_data = read10xResults(data_dir)

min_cluster = 2 max_cluster = 10 cores = 14

sce = SingleCellExperiment( assays = list( counts = as.matrix(assays(pbmc_data)$counts), logcounts = log2(as.matrix(assays(pbmc_data)$counts) + 1) ), colData = colData(pbmc_data) )

rowData(sce)$feature_symbol = rownames(sce) sce <- sce[!duplicated(rowData(sce)$feature_symbol), ]

system.time(sc_result <- sc3(sce, ks = min_cluster:max_cluster, biology = TRUE, n_cores=cores, rand_seed=0))

print(table(colData(sc_result)$sc3_2_clusters, exclude=NULL)) `

pati-ni commented 6 years ago

Hi @TChan92 , thanks for reporting the issue. To be able to reproduce the error that you got it would be a great help if you could provide an rds file of the sce object after your data has been loaded and you generated the SingleCellExperiment. You can do that with saveRDS(sce, 'my_sce.rds')

You can provide a URL with your object on nik.patik@gmail.com.

Thank you

TChan92 commented 6 years ago

@pati-ni I emailed you the my rds file a few days ago, let me know if you have any issues it. Thanks, Tim

pati-ni commented 6 years ago

@TChan92 thank you for your mail. I will test your case as soon as possible.

shabs24 commented 6 years ago

I am getting same problem while running Scmap on my data. Any suggestions or fix?

pati-ni commented 6 years ago

Hi @TChan92, thanks for your patience. I will look into the dataset today. Probably some zeros or NaN do not play well with some part of the analysis.

pati-ni commented 6 years ago

@TChan92 which version of SC3 are you currently using? is it from bioconductor or github?

pati-ni commented 6 years ago

So sorry @shabs24, I thought @TChan92 replied this thread. Can I ask what's your version of SC3?

pati-ni commented 6 years ago

@shabs24 if you do not get the error with sc3 can you open an issue on scmap instead? However keep an eye on this issue because it is probably related.

TChan92 commented 6 years ago

I was using the latest version of SC3 from bioconductor.

shabs24 commented 6 years ago

Sorry @pati-ni !! The issue is not related to SC3. When I create an index for clusters using Scmap, a lot of genes have median of 0 and when I try to scale it I get NA's . It effects my downstream analysis. Any suggestions?

wikiselev commented 6 years ago

@shabs24 scmap-cluster should remove genes with all zeros from the index, and if your genes have at least 1 non-zero value, scaling shouldn't produce NaNs, I believe.

shabs24 commented 6 years ago

Thanks @wikiselev! Scmap-cluster removes the genes with zero index but sometimes it results in very fewer genes to compare. Projection of the same data set works fine but projection of any other dataset leaves most of it unassigned. I might have to look for alternative way for the analysis.

wikiselev commented 6 years ago

@shabs24 you either can use a different feature selection method (not the default scmap one) to have more genes in the index, or you can reduce the default similarity threshold (threshold parameter in the scmapCluster function) to something lower than 0.7.

azampvd commented 5 years ago

Hi, I'm trying to run SC3.1.12.0 on a large dataset. The result provides a lot of cells labeled as NAs. I have tried running SC3 multiple times, and the issue still remains. Here is the code I used: sce <- SingleCellExperiment( assays = list(counts = t(mat), logcounts = t(matlog) ), colData = ann ) sce <- sc3_prepare(sce) rowData(sce)$feature_symbol <- rownames(sce)

remove features with duplicated names

sce <- sce[!duplicated(rowData(sce)$feature_symbol), ]
sce <- sc3_calc_dists(sce)
sce <- sc3_calc_transfs(sce)
sce <- sc3_kmeans(sce, ks = 3)
col_data <- colData(sce)
sce <- sc3_calc_consens(sce)
col_data <- colData(sce)

The command "sce <- sc3(sce, ks = 3, biology = FALSE,gene_filter=TRUE,rand_seed =1)", also gives NA cells.

wikiselev commented 5 years ago

@azampvd please read the instructions on how SC3 behaves when your dataset is bigger than 5000 cells: https://bioconductor.org/packages/release/bioc/vignettes/SC3/inst/doc/SC3.html#hybrid-svm-approach you will need to run an additional command sc3_run_svm to predict the labels of NA cells.