immunogenomics / symphony

Efficient and precise single-cell reference atlas mapping with Symphony
GNU General Public License v3.0
95 stars 22 forks source link

Labels in Memory T Cell Atlas missing #14

Closed mihem closed 2 years ago

mihem commented 2 years ago

Thanks for this great package that is really helpful in single cell analysis.

I struggle with the memory T cell atlas reference, though. Following the tutorial and downloading the annotations from GEO results in 260 missing labels.

ref <- readRDS('../pre-built_references/tbru_ref.rds')
ref_tbru <- readRDS("tbru_ref.rds")
ref_tbru$save_uwot_path <- "tbru_uwot_model"
ref_tbru$meta_data <- read_delim("GSE158769_meta_data.txt")

sum(is.na(ref_tbru$meta_data$cluster_ids))
sum(is.na(ref_tbru$meta_data$cluster_name))

> [1] 260
> [1] 260

This wouldn't be a problem in this huge dataset I think, but after mapQuery, knnPredict.Seurat (I use a Seurat query object) fails with

Error in class::knn(t(ref_obj$Z_corr), Embeddings(query_obj, "harmony"), : no missing values are allowed

Traceback:

3: stop("no missing values are allowed") 2: class::knn(t(ref_obj$Z_corr), Embeddings(query_obj, "harmony"), ref_obj$meta_data[[label_transfer]], k = k) at utils_seurat_symphony.R#452 1: knnPredict.Seurat(tcells, ref_tbru, "cluster_name")

@joycekang Can you maybe provide all labels (same lab, right? ;)) or provide a workaround (filter out cells with missing label?).

Thank you, Mischko

joycekang commented 2 years ago

Hi @mihem,

Thanks for the question:

The reference object currently published on Zenodo was built before the publication of the Nathan et al. dataset. Therefore, it does not contain cell type labels or protein expression. These can now be obtained from GEO (GSE158769): https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158769

You can update the reference metadata directly, such as ref$meta_data = new_meta_data. Let me know if that helps!

mihem commented 2 years ago

Hi Joyce,

Sorry I think you misunderstood me. What you said is well explained in your vignette. I downloaded the ref object from zenodo and the meta data from geo and added them exactly as you explained. However there are several missing labels (260). Therefore the knn.Predict function fails.

So my question is not: "where to get the meta data"... but "could you please provide the labels without missing values or help me subsetting the ref object so that it removes the few cells with missing values".

Thank you

joycekang commented 2 years ago

Ah, apologies - I read it too fast. I think removing the NA cells is a good way to go.

To do this, here is some code (haven't tried running it myself but this should work):

idx_na = which(is.na(ref_tbru$meta_data$cluster_ids)) ref_tbru$meta_data = ref_tbru$meta_data[-idx_na, ] ref_tbru$Z_corr = ref_tbru$Z_corr[,-idx_na] ref_tbru$Z_orig = ref_tbru$Z_orig[,-idx_na] # not essential ref_tbru$R = ref_tbru$R[,-idx_na]

Can you give that a try? If that doesn't work I can take another look tonight. Thanks!

mihem commented 2 years ago

Hi Joyce,

that was an easy solution and worked well, thanks!

the class::knn part of the knnPredict function is a performance bottleneck though, took me ~5 min to run on this large dataset, in comparison mapQuery only takes a few second. In general symphony/harmony is super fast compared to the integration function of Seurat so it would be nice if this function would be equally fast I think.

joycekang commented 2 years ago

hi @mihem,

Glad that worked. For the k-NN function, I agree that it can be suboptimally slow for large reference. What I will say is that Symphony embeddings are agnostic to the downstream inference function. That is, you can train any sort of model (multinomial logistic regression using the glmnet package, SVM, etc) using the Harmonized PCs from the reference cells as input, and cell type labels as output. In our paper, we found that k-NN (marginally) outperformed the other methods we tried, but these other methods would be a lot faster than k-NN once the models are trained.

As example, multinomial LR would be something like this:

mod.glmnet <- glmnet::cv.glmnet( x = t(reference$Z_corr), y = as.factor(reference$meta_data$cell_type), family = "multinomial", alpha = 0, type.multinomial = "grouped" )

predictions = mod.glmnet %>% predict(t(query$Z), type = 'class', s = 'lambda.min')

@ilyakorsunsky has implemented a slightly more sophisticated version of this for the Fibroblast Atlas from our lab, where each soft cluster learns it's own model, then for a given query cell it incorporates predictions from all soft clusters.

mihem commented 2 years ago

Thanks for the insights. Okay, but training the multinomial LR took extremely long and that's also part of the pipeline. I mean it's not super important, but isn't there a faster C++ implemented knn library that you can use?

joycekang commented 2 years ago

Fair point! I have it on the todo list - do you happen to have a suggestion for a faster C++ implemented library?

mihem commented 2 years ago

Cool. Sorry, no real expertise on that. Some googing led me to: https://github.com/eddelbuettel/rcppannoy which is a well-maintained package and is used by Seurat, based on the python Annoy package from Spotify https://github.com/jlmelville/rcpphnsw also well maintained Again no expert, but I guess if those have a similar accuracy in your benchmarking, they are a great choice because they should lead to an enormous (>100x?) speed increase.