hbctraining / scRNA-seq_online

https://hbctraining.github.io/scRNA-seq_online/.
516 stars 180 forks source link

Getting DE lesson back on schedule page #39

Closed mistrm82 closed 2 years ago

mistrm82 commented 3 years ago

small issue with new function names

marypiper commented 3 years ago

It appears the function calculateQCMetrics is now defunct and can be replaced as described below. I also wanted to check regarding the code, much of it in the wrangling section is from the referenced tutorial, but not sure if we should make some additional changes - we should discuss. I will run through this code before I add it back to the schedule to make sure it all works again.

Usage calculateQCMetrics(...)

S4 method for signature 'SingleCellExperiment' normalize(object, ...)

centreSizeFactors(...) Arguments object, ... Ignored arguments. Details calculateQCMetrics is succeeded by perCellQCMetrics and perFeatureQCMetrics. normalize is succeeded by logNormCounts. centreSizeFactors has no replacement - the SingleCellExperiment is removing support for mul- tiple size factors, so this function is now trivial.

AlessiaCaramello commented 3 years ago

It appears the function calculateQCMetrics is now defunct and can be replaced as described below. I also wanted to check regarding the code, much of it in the wrangling section is from the referenced tutorial, but not sure if we should make some additional changes - we should discuss. I will run through this code before I add it back to the schedule to make sure it all works again.

Usage calculateQCMetrics(...)

S4 method for signature 'SingleCellExperiment' normalize(object, ...)

centreSizeFactors(...) Arguments object, ... Ignored arguments. Details calculateQCMetrics is succeeded by perCellQCMetrics and perFeatureQCMetrics. normalize is succeeded by logNormCounts. centreSizeFactors has no replacement - the SingleCellExperiment is removing support for mul- tiple size factors, so this function is now trivial.

Hey @marypiper Regarding the above, I tried to use perCellQCMetrics instead of calculateQCMetrics, but I'm stuck. When I run the code in the following step, which is:

sce$is_outlier <- isOutlier( metric = sce$total_features_by_counts, nmads = 2, type = "both", log = TRUE)

I'm getting the following error: Error in log2(metric) : non-numeric argument to mathematical function

When I check sce$total_features_by_counts, it returns NULL.

Do you know how to proceed instead? I'm very new to R, so sorry if this might be a silly question!

Thank you very much for your help!

marypiper commented 3 years ago

Hi @AlessiaCaramello , you don't actually need to run the QC metrics. If you are using your own data, just use the thresholds for QC that you had used to determine the cell type clusters. I have updated most of the code already to not use scater for QC, but I need time that I don't have right now to finish it. I used the following code to filter the data using our previous thresholds (generally I would bring in my filtered raw counts and not perform re-filtering). I plan to have the lesson finished and updated by the end of summer.

## Check the assays present
assays(sce)

## Explore the raw counts for the dataset
dim(counts(sce))

counts(sce)[1:6, 1:6]

## Explore the cellular metadata for the dataset
dim(colData(sce))

head(colData(sce))

# Named vector of cluster names
kids <- purrr::set_names(levels(sce$cluster_id))
kids

# Total number of clusters
nk <- length(kids)
nk

# Named vector of sample names
sids <- purrr::set_names(levels(sce$sample_id))

# Total number of samples 
ns <- length(sids)
ns

# Generate sample level metadata

## Determine the number of cells per sample
table(sce$sample_id)

## Turn named vector into a numeric vector of number of cells per sample
n_cells <- as.numeric(table(sce$sample_id))

## Determine how to reorder the samples (rows) of the metadata to match the order of sample names in sids vector
m <- match(sids, sce$sample_id)

## Create the sample level metadata by combining the reordered metadata with the number of cells corresponding to each sample.
ei <- data.frame(colData(sce)[m, ], 
                 n_cells, row.names = NULL) %>% 
  select(-"cluster_id")
ei

## Keep previously filtered cells
counts <- GetAssayData(object = filtered_seurat, slot = "counts")
rownames(counts) %>% head()
colnames(counts) <- colnames(counts) %>% str_replace_all(pattern = "ctrl_", replacement = "")
colnames(counts) <- colnames(counts) %>% str_replace_all(pattern = "stim_", replacement = "")
length(which(colnames(counts) %in% colnames(sce)))

length(rownames(counts))

filtered_sce <- sce[which(rownames(sce) %in% rownames(counts)), ]
filtered_sce <- filtered_sce[ , which(colnames(filtered_sce) %in% colnames(counts)), ]
dim(filtered_sce)

saveRDS(filtered_sce, "data/filtered_sce.rds")

sce <- filtered_sce
AlessiaCaramello commented 3 years ago

Hi @AlessiaCaramello , you don't actually need to run the QC metrics. If you are using your own data, just use the thresholds for QC that you had used to determine the cell type clusters. I have updated most of the code already to not use scater for QC, but I need time that I don't have right now to finish it. I used the following code to filter the data using our previous thresholds (generally I would bring in my filtered raw counts and not perform re-filtering). I plan to have the lesson finished and updated by the end of summer.

## Check the assays present
assays(sce)

## Explore the raw counts for the dataset
dim(counts(sce))

counts(sce)[1:6, 1:6]

## Explore the cellular metadata for the dataset
dim(colData(sce))

head(colData(sce))

# Named vector of cluster names
kids <- purrr::set_names(levels(sce$cluster_id))
kids

# Total number of clusters
nk <- length(kids)
nk

# Named vector of sample names
sids <- purrr::set_names(levels(sce$sample_id))

# Total number of samples 
ns <- length(sids)
ns

# Generate sample level metadata

## Determine the number of cells per sample
table(sce$sample_id)

## Turn named vector into a numeric vector of number of cells per sample
n_cells <- as.numeric(table(sce$sample_id))

## Determine how to reorder the samples (rows) of the metadata to match the order of sample names in sids vector
m <- match(sids, sce$sample_id)

## Create the sample level metadata by combining the reordered metadata with the number of cells corresponding to each sample.
ei <- data.frame(colData(sce)[m, ], 
                 n_cells, row.names = NULL) %>% 
  select(-"cluster_id")
ei

## Keep previously filtered cells
counts <- GetAssayData(object = filtered_seurat, slot = "counts")
rownames(counts) %>% head()
colnames(counts) <- colnames(counts) %>% str_replace_all(pattern = "ctrl_", replacement = "")
colnames(counts) <- colnames(counts) %>% str_replace_all(pattern = "stim_", replacement = "")
length(which(colnames(counts) %in% colnames(sce)))

length(rownames(counts))

filtered_sce <- sce[which(rownames(sce) %in% rownames(counts)), ]
filtered_sce <- filtered_sce[ , which(colnames(filtered_sce) %in% colnames(counts)), ]
dim(filtered_sce)

saveRDS(filtered_sce, "data/filtered_sce.rds")

sce <- filtered_sce

Hi @marypiper, thanks so much for your quick reply!

Just FYI, we actually solved it using addPerCellQCMetrics() but taking in consideration that sce$total_features_by_counts is being renamed to sce$detected.

# Calculate quality control (QC) metrics
sce <- addPerCellQCMetrics(sce)

# Get cells w/ few/many detected genes
sce$is_outlier <- isOutlier(
  metric = sce$detected,
  nmads = 2, type = "both", log = TRUE)

It's true QC is done before, although I can see the total number of cells is reduced by ~4k cells (from 29k to 25k) after removing outliers at this stage. Is it worth doing this second QC anyway?

marypiper commented 3 years ago

@AlessiaCaramello, thanks for sharing your code! In our practice, we do not generally remove 'outliers' using a function like this. We prefer to explore our data and choose thresholds appropriate for our dataset. That being said, we do often use minimum genes detected thresholds, to remove poor quality cells, but we don't usually use maximum genes detected thresholds, which are sometimes used to identify doublets; we find they do not accurately identify doublets and instead filter out quality cells.

Thanks again for your issue and response!

marypiper commented 2 years ago

I have updated the code to use previously filtered object. There is no longer any requirement for the scater package.