igrabski / sc-SHC

Significance analysis for clustering single-cell RNA-sequencing data
87 stars 10 forks source link

Error running scSHC even with matrix dimension >3 #23

Closed sg3451 closed 1 month ago

sg3451 commented 1 month ago

I am been running scSHC successfully on a set of clustered scRNAseq data earlier, but keep getting an error in trying to do the same for a very similar dataset. The error seems to be related to matrix dimensions being <3, but I have enough cells in all 6 clusters (the smallest cluster has 320 cells). Code is below.

----------------------------------------------------------------------

data_matrix = scnorm_nocc_oldIL1Bss@assays$SCT@data dim(data_matrix) [1] 14065 4493

---------------------------------------------------------------------------

cluster_ids <- as.character(scnorm_nocc_oldIL1Bss$seurat_clusters) length(cluster_ids)
[1] 4493

---------------------------------------------------------------------------

table(scnorm_nocc_oldIL1Bss$seurat_clusters)

0 1 2 3 4 5 1042 997 905 629 600 320

----------------------------------------------------------------------------------

head(scnorm_nocc_oldIL1Bss) orig.ident nCount_RNA nFeature_RNA percent.mt percent.rb largest_count largest_index largest_gene percent_largest_gene nCount_SCT AAACCCAAGACATAAC-1 oldIL1B 15365 3334 9.547673 18.62675 1182 1880 Ins1 7.692808 17581 AAACCCAAGCTATCTG-1 oldIL1B 15123 3078 9.733518 19.13641 1703 1880 Ins1 11.260993 17623 AAACCCAAGTCTTCGA-1 oldIL1B 21731 4516 9.852285 16.38213 476 15144 Mt-atp6 2.190419 19071 AAACCCAAGTTGCTGT-1 oldIL1B 34572 5361 9.467199 18.20838 1486 1880 Ins1 4.298276 19305 AAACCCACACCCTCTA-1 oldIL1B 38000 5235 9.607895 17.55263 3086 1880 Ins1 8.121053 19066 AAACCCAGTATGGGAC-1 oldIL1B 34199 4833 9.330682 19.49472 2226 1880 Ins1 6.508962 19132 AAACCCAGTCACCGAC-1 oldIL1B 15442 3364 9.137417 21.41562 433 1513 AC134224.3 2.804041 17542 AAACCCAGTCATAGTC-1 oldIL1B 14258 3202 9.131716 23.18698 240 15144 Mt-atp6 1.683266 17402 AAACCCAGTTAGCGGA-1 oldIL1B 11104 2829 9.393012 21.93804 228 15142 Mt-co2 2.053314 17230 AAACCCATCAACTCTT-1 oldIL1B 22816 3734 8.419530 15.51543 4161 1880 Ins1 18.237202 18878 nFeature_SCT S.Score G2M.Score Phase pANN_0.25_0.11_187 DF.classifications_0.25_0.11_187 SCT_snn_res.0.4 seurat_clusters AAACCCAAGACATAAC-1 3297 -0.202145524 -0.4915970 G1 0.2755102 Singlet 0 0 AAACCCAAGCTATCTG-1 3049 -0.129014521 -0.4397100 G1 0.2725948 Singlet 0 0 AAACCCAAGTCTTCGA-1 4460 0.002762144 0.4801585 G2M 0.3309038 Singlet 4 4 AAACCCAAGTTGCTGT-1 4767 0.128803299 -0.1084374 S 0.4591837 Singlet 0 0 AAACCCACACCCTCTA-1 4228 -0.246844981 -0.4229193 G1 0.3600583 Singlet 0 0 AAACCCAGTATGGGAC-1 4209 0.041980890 0.2243246 G2M 0.4373178 Singlet 2 2 AAACCCAGTCACCGAC-1 3320 0.260306369 -0.2693131 S 0.3148688 Singlet 2 2 AAACCCAGTCATAGTC-1 3163 -0.144225057 -0.4338696 G1 0.1734694 Singlet 1 1 AAACCCAGTTAGCGGA-1 2881 -0.146266415 0.3050540 G2M 0.3556851 Singlet 5 5 AAACCCATCAACTCTT-1 3678 -0.165976851 0.4794807 G2M 0.4183673 Singlet 5 5

----------------------------------------------------------------------------------------------------------

new_clusters = testClusters(scnorm_nocc_oldIL1Bss@assays$SCT@data,

  • cluster_ids = as.character(scnorm_nocc_oldIL1Bss$seurat_clusters),
  • alpha = 0.05, #control of FWER, can be relaxed if needed
  • num_features = 2500, #default number of genes used
  • num_PCs = 30, #default number of PCs
  • parallel = TRUE, #can set to FALSE if desired
  • cores = 1) Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix", : dimension of 'A' must be at least 3
igrabski commented 1 month ago

Hi Sujay, it looks like you are running it on the data slot of your Seurat object, but I believe this is typically log-normalized data -- instead, you should put the raw counts as input. Please let me know if that doesn't resolve your error!

sg3451 commented 1 month ago

Thank you very much for the quick response. You are right - I am indeed using the log-normalized data from the 'data' slot. But a few questions still remain:

  1. I used the data slot for running scSHC with 5 other datasets without any issues this past week.
  2. The min and max of the log normalized counts for the current dataset are 0 and 7.2, respectively, so the matrix is non-negative.
  3. I am also unsure about using the 'count' slot because (a) it is not normalized data, and (b) it was not used for the external clustering tool that I used here.

I did a rowSums check and there are rows (genes) with all zeros in this dataset. Assuming Could these be messing up the calculations?

igrabski commented 1 month ago

Interesting, in that case you might be right that the genes with all 0s are causing issues -- try removing those and see if the issue is fixed? If that, and using the raw counts, doesn't address the problem, you are welcome to share your data with me at igrabski[at]nygenome[dot]org (if it is possible to share) and I am happy to take a look to debug.

As far as the input data, however, I definitely do recommend using the raw counts. The statistical model is intended to consider raw (completely unnormalized) counts and has its own considerations for handling variation in sequencing depth, etc. Although I have not tested this, you could also consider using specifically the counts slot of the SCT assay, which I believe represents depth-corrected counts. There will always be at least some level of difference between this approach and how external clustering methods identify clusters, but in practice testClusters should still yield reasonable results.

sg3451 commented 1 month ago

Okay, tried both ways today. Replacing @data with @counts worked! Removing the genes with all zeros but keeping @data did not work (gave the same error). So it is not a function of the genes being all zeros. Thank you for your recommendation to use the count slot - I think that is very important especially if the statistical tool is configured to model the counts.

igrabski commented 1 month ago

Great, I am glad to hear it works and happy to help!

sg3451 commented 1 month ago

Yes, I am very glad that I ran into this problem in the first place! Otherwise, I would have kept using 'data' instead of 'counts'! As an update, when I re-ran the previous analysis now with 'counts', for 2 studies SHC agreed with Seurat-based clusters for 4 studies but not for 2 other studies . Originally, however, SHC had agreed with the clustering of these 2 studies as well (attesting to the problems associated with using the 'data' slot).

igrabski commented 1 month ago

Thanks for the update and definitely glad to hear!!