samuel-marsh / scCustomize

R package with collection of functions created and/or curated to aid in the visualization and analysis of single-cell data using R.
https://samuel-marsh.github.io/scCustomize/
GNU General Public License v3.0
220 stars 23 forks source link

Access underlying matrix used in Clustered_DotPlot to remove NA/NaN/Inf #178

Open nehawali21 opened 7 months ago

nehawali21 commented 7 months ago

Thank you so much for creating this package; it's been revolutionary for spicing up the standard Seurat figures! As a novice bioinformatician, I'm afraid I've run into a small issue that I hope I could kindly get some help with.

I'm trying to create a clustered dotplot of about 200 genes' expression across identities after Seurat analysis of a small published scRNAseq dataset.

However, inputting my genes of interest yields the following error, which I believe is due to scaling introducing issues in the matrix used to make the dotplot:

Error in hclust(get_dist(t(submat), distance), method = method) : 
  NA/NaN/Inf in foreign function call (arg 10)

Below are the parameters I'm using in the Clustered_DotPlot function:

Clustered_DotPlot(seurat, 
                  features = features,
                  exp_value_type = "scaled",
                  plot_padding = c(5,5,5,5), 
                  cluster_feature = TRUE,
                  cluster_ident = TRUE,
                  raster = FALSE,
                  seed = 1212,
                  x_lab_rotate = 90, 
                  row_label_size = 10,
                  column_label_size = 10,
                  legend_label_size = 10,
                  legend_title_size = 10,
                  plot_km_elbow = FALSE, 
                  flip = TRUE)

I'm not sure how to access the underlying matrix the Clustered_DotPlot function uses so that I could examine and exclude genes causing the issue. I'd be most grateful on what I can do to generate a graph.

Also, can you kindly clarify which direction the scaling occurs in, i.e. by row (originally a gene's expression across clusters) or by column (originally identities)? That is, will the flip parameter retain the original scaling (just now transposed numbers) as it only flips axes at the end to generate the plot?

samuel-marsh commented 7 months ago

Hi,

Does the underlying raw, normalize, or scale data have any NAs present? Scaling occurs within gene across identity used. flip parameter is simply visualization change.

Best, Sam

nehawali21 commented 7 months ago

I've tried Seurat::GetAssayData on the Seurat object to extract the counts, data, and scale.data matrices for my genes of interest. However, I'm not getting it to subset to my features properly and I'm only getting the larger sparse matrices for all features. Nevertheless, I imagine NAs are being introduced in scaling for these genes in Clustered_DotPlot, as I can plot these genes with Seurat::DotPlot and scale = TRUE. Is Clustered_DotPlot not scaling the same way as DotPlot perhaps?

Hence, I'm wondering how one could extract the underlying data used in Clustered_DotPlot to see which genes are playing spoilsport and remove them from the query.

samuel-marsh commented 7 months ago

So the weird thing is that I do have an NA check in the function so not sure how this is slipping past this. Also for unsplit ClusteredDotPlots the data actually gets pulled directly from Seurat's DotPlot function.

Can you run the following as a test and post the output?

data <- FetchData(OBJ_NAME, vars = GENE_LIST)
colnames(data)[ apply(data, 2, anyNA) ]

Thanks! Sam

samuel-marsh commented 7 months ago

Also can you confirm what version of scCustomize you have installed?

nehawali21 commented 7 months ago

Sure thing. I have scCustomize version 2.1.2 and Seurat version 5.0.3.

After putting in the suggested code:

data <- FetchData(seurat, vars = features)
colnames(data)[ apply(data, 2, anyNA) ]

this is the only output:

character(0)
samuel-marsh commented 7 months ago

Ok working through issues here.

did you scale the data yourself or was it scaled in the downloaded object? Does scaling it yourself fix issue?

Best, Sam

nehawali21 commented 7 months ago

Thank you so much for kindly continuing to look into this.

The object had raw reads so scaling is coming from me with the Seurat workflow.

samuel-marsh commented 7 months ago

What happens if you run the FetchData code block but specify the scale data

data <- FetchData(seurat, vars = features, layer = "scale.data")
colnames(data)[ apply(data, 2, anyNA) ]
nehawali21 commented 7 months ago

I get character(0) at the end of this too.

I also get warning messages that genes can't be found when using scale.data, probably because they aren't all within the 4000 HVGs I had specified for Seurat::FindVariableFeatures. Still, Seurat::DotPlot and scale = TRUE can generate a plot, so I'm not sure what's going on.

samuel-marsh commented 7 months ago

Ok we'll figure it out lol.

Can you post the full code you ran after downloading the object? I'll try and recreate on my end.

Best, Sam

nehawali21 commented 7 months ago

Sure. The things I read in were downloaded from these publicly available sources: raw_reads.rds and umap.rds are from https://github.com/ScialdoneLab/human-gastrula-shiny E-MTAB-9388.sdrf.txt is from https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-9388

I've just put a generic [FILEPATH] here as it would need to change, and I've condensed the code for ease, including a smaller list of features which still yields issues.

# packages
if (!require("pacman", quietly = TRUE)) install.packages("pacman"); library(pacman)
pacman::p_load(Seurat, tidyverse, scCustomize, janitor)
pacman::p_load_gh("satijalab/seurat-wrappers")

# read in things
raw_reads_gastrula <- readRDS("[FILEPATH]\\raw_reads.rds") %>% t() %>% as.data.frame() 
umap_metadata_gastrula <- readRDS("[FILEPATH]\\umap.rds") 
arrayexpress_metadata_gastrula <- read.delim(file = "[FILEPATH]\\E-MTAB-9388.sdrf.txt") %>% clean_names() %>% dplyr::select(-c(scan_name, comment_submitted_file_name, comment_fastq_uri)) %>% distinct()

# merge metadata tables into one to use for seurat object, first clean up cell names in one to match
# ensure columns are equivalent so that either one can be used as cell name for merging
table(arrayexpress_metadata_gastrula$source_name == arrayexpress_metadata_gastrula$factor_value_single_cell_identifier) # should be all 1195 TRUE

# duplicate cell names and as new col and change naming to match cell naming in umap metadata
arrayexpress_metadata_gastrula <- arrayexpress_metadata_gastrula %>% mutate(cell_name_merge = factor_value_single_cell_identifier) %>% mutate(cell_name_merge = gsub("_", "\\.", cell_name_merge))

# merge arrayexpress and umap metadata into one df for seurat object
full_metadata_gastrula <- full_join(umap_metadata_gastrula, arrayexpress_metadata_gastrula, by = c("cell_name" = "cell_name_merge"))

# create seurat object for next analyses
seurat_gastrula <- CreateSeuratObject(counts = raw_reads_gastrula, meta.data = full_metadata_gastrula, project = "gastrula", min.cells = 0, min.features = 0) 

# set cell identities to already annotated clusters info in metadata
Idents(seurat_gastrula) <- "cluster_id"

# don't need to subset based on QC because cells already processed before, just go to normalize
# QC - mt genes, etc.
seurat_gastrula[["percent.mt"]] <- PercentageFeatureSet(seurat_gastrula, pattern = "^MT\\.|^MTATP|^MTCO|^MTCYBP") 

# seurat_gastrula <- subset(seurat_gastrula, 
#                         subset = nFeature_RNA > 2000 & 
#                           percent.mt < 2 
#                         ) 

# normalize
set.seed(1212)
seurat_gastrula <- NormalizeData(seurat_gastrula, normalization.method = "LogNormalize", scale.factor = 10000)

# find variable features
set.seed(1212)
seurat_gastrula <- FindVariableFeatures(seurat_gastrula, selection.method = "vst", nfeatures = 4000) 

# scale data
set.seed(1212)
seurat_gastrula <- ScaleData(seurat_gastrula)

# PCA
set.seed(1212)
seurat_gastrula <- RunPCA(seurat_gastrula, npcs = 50)

# cell neighbors and clusters
set.seed(1212)
seurat_gastrula <- FindNeighbors(seurat_gastrula, dims = 1:30, k.param = 50) 

set.seed(1212)
seurat_gastrula <- FindClusters(seurat_gastrula, dims.use = 1:30, k.param = 50, resolution = 0.75, algorithm = 1) 

# UMAP
set.seed(1212)
seurat_gastrula <- RunUMAP(seurat_gastrula, reduction = "pca", dims = 1:30) 

# set cell identities to already annotated clusters info in metadata
Idents(seurat_gastrula) <- "cluster_id"

# create features list
features <- c("SCN2A" ,  "PIDD1"   ,"RBM38"  , "FDXR"  ,  "DCP1B",   "EPHA2",   "AK3"  ,   "RNASE7",  "ZNF385A", "CPEB2")

And here is the Seurat::DotPlot code which generates a scaled plot:

set.seed(1212)
DotPlot(seurat_gastrula, 
        cluster.idents = TRUE, 
        features = features) + 
  scale_colour_gradient2(low = "red",
                         mid = "lightgrey", 
                         high = "blue") +
  xlab("Gene") +
  ylab("Cluster") +
  scale_y_discrete(limits = rev) +
  theme(axis.text = element_text(color = "black",
                                 size = 14),
        axis.text.x = element_text(hjust = 1,
                                   angle = 45,
                                   face = "italic"),
        axis.title = element_text(color = "black",
                                  face = "bold",
                                  size = 14),
        axis.title.x = element_text(margin = margin(10,0,0,0,"mm")),
        axis.title.y = element_text(margin = margin(0,10,0,0,"mm")),
        legend.title = element_text(color = "black",
                                    face = "bold",
                                    size = 14),
        legend.text = element_text(color = "black",
                                   size = 14))

The Clustered_DotPlot code is the same as before (just with seurat_gastrula as the Seurat object rather than generic seurat).

samuel-marsh commented 7 months ago

Ok, so I figured out the issue and will work on scCustomize'd based solution but for now there is easy solution that you can do on your end.

The reason for the error is because one of the features you are trying to plot has no expression in the object ("RNASE7"). Because you created the object with min.cells = 0 the object contains all features including genes that have no expression. So for now you can just change that to min.cells = 1 which will only include genes that are found in at least one cell. Then plotting has no issues.

Best, Sam

nehawali21 commented 7 months ago

I see, thank you so much for kindly figuring this out! I can make the Clustered_DotPlot now.

Also, is it possible to move the legends up to the top right and make them 1 column, or remove the identity color annotations on the heatmap and in the legend, in Clustered_DotPlot?