igrabski / sc-SHC

Significance analysis for clustering single-cell RNA-sequencing data
87 stars 10 forks source link

scSHC gives different errors with and without batch. Example data runs fine #6

Closed mcsimenc closed 1 year ago

mcsimenc commented 1 year ago

Greetings!

I have been unable to run scSHC, getting different errors depending on whether I use the batch option. I have a Seurat analysis of snRNA-seq data. scSHC worked using the example sparse matrix counts from counts.rda.

The scSHC documentation says data should be a raw counts matrix or Matrix. I am trying normalized data but also tried using the matrix with values rounded up and saw the same error.

# Seurat analysis, three replicates, SCT normalization, Harmony batch correction
> so = readRDS("SeuratObject.rds")

# The batches
> head(so@meta.data$orig.ident)
[1] "Sb_root_nuclei_rep1" "Sb_root_nuclei_rep1" "Sb_root_nuclei_rep1" "Sb_root_nuclei_rep1" "Sb_root_nuclei_rep1" "Sb_root_nuclei_rep1"

# Data matrix
> soDat = GetAssayData(so, assay = "SCT")

# I tried as.matrix and Matrix::as.matrix too
> class(soDat)
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"

> dim(soDat)
[1] 19620 11146

# First few rows and columns of matrix
> soDat[1:5, 1:5]
5 x 5 sparse Matrix of class "dgCMatrix"
                      root_nuc_rep1_AAACCCAAGATCCCAT-1 root_nuc_rep1_AAACCCAAGGTAGGCT-1 root_nuc_rep1_AAACCCAAGTGCCAGA-1 root_nuc_rep1_AAACCCACAAGTGATA-1
Sobic.001G000200.v3.2                        .                                        .                        0.6931472                                .
Sobic.001G000400.v3.2                        .                                        .                        .                                        .
Sobic.001G000700.v3.2                        .                                        .                        .                                        .
Sobic.001G000800.v3.2                        0.6931472                                .                        .                                        .
Sobic.001G001200.v3.2                        .                                        .                        .                                        .
                      root_nuc_rep1_AAACCCAGTACACGTT-1
Sobic.001G000200.v3.2                                .
Sobic.001G000400.v3.2                                .
Sobic.001G000700.v3.2                                .
Sobic.001G000800.v3.2                                .
Sobic.001G001200.v3.2                                .

# Call using batch
> soDatClustered = scSHC(soDat, batch = so@meta.data$orig.ident, alpha = 0.05, num_features = 2500, num_PCs = 30, parallel = T, cores = 14)
Error in do.call(.Call, args = dot_call_args) : 
  TridiagEigen: eigen decomposition failed

# Call not using batch
> soDatClustered = scSHC(soDat, alpha = 0.05, num_features = 2500, num_PCs = 30, parallel = T, cores = 14)
Error in eigs_real_sym(A, nrow(A), k, which, sigma, opts, mattype = "sym_matrix",  : 
  dimension of 'A' must be at least 3

Any idea what might be causing this?

Thanks!

igrabski commented 1 year ago

In the example you gave, the error is because scSHC expects a matrix of non-negative integer counts. In principle, rounding the normalized data shouldn't cause an error in the code (one quick check would be to make sure that after rounding, you truly do have all non-negative integer counts), but I wouldn't recommend this strategy regardless because our approach is intended to operate on the raw counts without any kind of normalization, integration, or other processing. Instead, in this case, you should just use the raw counts (e.g. in the Seurat object, this should be slot = counts).

mcsimenc commented 1 year ago

Thank you for clarifying @igrabski. Does your model account for differences in sequencing depth among barcodes? This, as well as multiplet abundance, affects results from other clustering procedures.

igrabski commented 1 year ago

Thanks for your question -- yes, our procedure accounts for sequencing depth, which is why it is not necessary to normalize beforehand.

mcsimenc commented 1 year ago

That's great. What are your thoughts on multiplet inclusion? It seems possible to me that any multiplets present could affect which genes are chosen as highly variable. Is there a way to have scSHC use specific genes?

igrabski commented 1 year ago

Yes, that's true, multiplet inclusion could skew your results. Currently, we use an automated gene selection procedure, but I would recommend removing columns correspond to suspected multiplets (as determined by any procedure) prior to running our approach.