lzygenomics / SONAR

SONAR is the algorithm of cell-type deconvolution for spatial transcriptomics
GNU General Public License v3.0
5 stars 2 forks source link

NaN issue in preclustering step #4

Open Aditya-Venkat opened 1 week ago

Aditya-Venkat commented 1 week ago

Dear author,

Thanks for resolving the previous issue we raised. We were trying to apply SONAR on another dataset and ran into an error while executing the R part of the code while running SONAR.preprocess(). Please find attached the error snapshot:

If you could help us figure out how to resolve this issue that would be greatly appreciated.

unnamed

lzygenomics commented 1 week ago

Hi there,

Thank you for using our method! It seems that the issue might be due to the input spatial count matrix containing cells with zero total counts or genes with zero total counts across all cells, which may not meet the requirements of SCTransform. Additionally, there might be NA, NaN, or infinite values present in your spatial data. We kindly suggest checking for any cells or genes with empty counts and filtering them out, as well as ensuring there are no NA, NaN, or infinite values before proceeding.

We hope this helps, and please feel free to reach out if you have any further questions!

Aditya-Venkat commented 1 week ago

Hey,

Thanks a lot for the prompt response. We are applying the following filtration steps before passing the data :

ref <- t(SC$X) row_ind_SC <- which(rowSums(ref) > 0) col_ind_SC <- which(colSums(ref) > 0) ref <- ref[row_ind_SC,] ref <- ref[,col_ind_SC]

annotations

cell_type <- SC$obs[col_ind_SC,]$cells cell_type <- factor(cell_type) CellID <- factor(colnames(ref)) cluster<- data.frame(cellname = CellID, celltype= cell_type) typeanno <- cluster$celltype names(typeanno) <- cluster$cellname typeanno <- as.factor(typeanno)

spots <-as.sparse(t(ST$X)) class(spots) <- StripAttr(class(spots)) spots <- round(spots)

row_ind <- which(rowSums(spots) > 0) col_ind <- which(colSums(spots) > 0) spots <- spots[row_ind,] spots <- spots[, col_ind]

pos <- ST$obsm$spatial[col_ind,]

A check for nan or inf values post-filtration also returns the absence of such entries in the data. Could you please let us know how to proceed from here ?

lzygenomics commented 1 week ago

Hi there,

Thank you for sharing the details! The operations you’ve conducted seem to be correct. However, to help us identify the issue more precisely, could you please provide the relevant spatial object details, including spots, nUMI_spot, and coords used the following command: SONAR.preprocess(sc_count = ref, sc_cell_type = typeanno, sc_nUMI = nUMI, sp_coords = coords, sp_count = spots, sp_nUMI = nUMI_spot, cores = 8) Alternatively, if possible, you could share a subset of a demo dataset that represents the problem. This would greatly help us in troubleshooting and providing more targeted assistance.

Thank you for your cooperation!

lzygenomics commented 6 days ago

I noticed that in your code you used spots <- round(spots). Just to share, our algorithm works with spatial data in the form of raw UMI counts, which are integers, so rounding isn't usually needed. Additionally, the sctransform normalization step is designed to be applied to non-normalized data. I hope this clarification is helpful~

Aditya-Venkat commented 6 days ago

Hey,

Below is a sample dataset we are facing the issue on.

ST data : ST.zip SC data is too big to attach but it can be found on this link : https://cells.ucsc.edu/?ds=lung-dev

Thanks for the clarifications, I have removed the rounding step. We aren't applying any normalization before passing it into SONAR.preprocess.

Thanks you !

lzygenomics commented 6 days ago

Thank you very much for providing the demo. If it's not too much trouble, could you share the code from the loading this h5ad file to the point where the bug occurs? It would be really helpful for us in identifying the issue.

Aditya-Venkat commented 5 days ago

This is the code we were using.

options(matlab.path="/home/ajita/Desktop") library(anndata) library(SONAR)

library(DescTools) library(scater) library(Matrix) library(data.table) library(Seurat) library(matlabr) library(R.matlab)

set.seed(101) code_path <- "/data/ajita/Spatial/benchmarking_spatial_deconvolution/SONAR/core-code/"

SC <- read_h5ad("/data/ajita/Spatial/Datasets/Spatial_Deconvolution/Developmental_Lung/SC/SC_new_annotation_50000.h5ad") ref <- t(SC$X) row_ind_SC <- which(rowSums(ref) > 0) col_ind_SC <- which(colSums(ref) > 0) ref <- ref[row_ind_SC,] ref <- ref[,col_ind_SC]

cell_type <- SC$obs$cell_type_newL2[col_ind_SC] cell_type <- factor(cell_type) CellID <- factor(colnames(ref)) cluster<- data.frame(cellname = CellID, celltype= cell_type) typeanno <- cluster$celltype names(typeanno) <- cluster$cellname typeanno <- as.factor(typeanno)

st_path <- "/data/ajita/Spatial/Datasets/Spatial_Deconvolution/Developmental_Lung/ST/ST.h5ad" ST <- read_h5ad(st_path) ST$var_names_make_unique() spots <-as.sparse(t(ST$X)) class(spots) <- StripAttr(class(spots))

row_ind <- which(rowSums(spots) > 0) col_ind <- which(colSums(spots) > 0) spots <- spots[row_ind,] spots <- spots[, col_ind]

pos <- ST$obsm$spatial[col_ind,]

colnames(pos) <- c('x','y') coords <- as.data.frame(pos) row.names(coords)<-colnames(spots)

get the overlap genes

overlap_gene <- intersect(rownames(spots), rownames(ref)) ref <- ref[overlap_gene,]

spots <- spots[overlap_gene,]

nUMI <- colSums(ref) names(nUMI) <- colnames(ref) nUMI_spot <- colSums(spots) names(nUMI_spot) <- colnames(spots)

class(ref) <- StripAttr(class(ref)) class(spots) <- StripAttr(class(spots)) processed_input<-SONAR.preprocess(sc_count=ref,sc_cell_type=typeanno,sc_nUMI=nUMI,sp_coords=coords,sp_count=spots,sp_nUMI=nUMI_spot,cores=8,type_min_cell = 0 , spot_min_UMI = 0)

Upon a bit of experimentation we have realized the method runs if we set spot_min_UMI to be it's default value of 100 rather than 0 as we did here, but we are trying to see if there's a way to get the output without any spot-loss.

Is there a way to achieve this ?

lzygenomics commented 5 days ago

Hey, Thank you so much for your detailed response, I’m sorry to say that there might not be a perfect solution here. Spots with extremely low nUMI counts often have sequencing quality issues and tend to contain a lot of noise, making analysis results based on these spots quite challenging. In most analyses, this kind of data is usually filtered out, especially since regression tasks require relatively high data quality. We chose the threshold of 100 based on similar algorithms, and typically, most spots can pass this threshold. I hope this helps clarify things a bit~