Closed mihem closed 2 years ago
Hi @mihem,
Thanks for the question:
The reference object currently published on Zenodo was built before the publication of the Nathan et al. dataset. Therefore, it does not contain cell type labels or protein expression. These can now be obtained from GEO (GSE158769): https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158769
You can update the reference metadata directly, such as ref$meta_data = new_meta_data. Let me know if that helps!
Hi Joyce,
Sorry I think you misunderstood me. What you said is well explained in your vignette. I downloaded the ref object from zenodo and the meta data from geo and added them exactly as you explained. However there are several missing labels (260). Therefore the knn.Predict
function fails.
So my question is not: "where to get the meta data"... but "could you please provide the labels without missing values or help me subsetting the ref object so that it removes the few cells with missing values".
Thank you
Ah, apologies - I read it too fast. I think removing the NA cells is a good way to go.
To do this, here is some code (haven't tried running it myself but this should work):
idx_na = which(is.na(ref_tbru$meta_data$cluster_ids)) ref_tbru$meta_data = ref_tbru$meta_data[-idx_na, ] ref_tbru$Z_corr = ref_tbru$Z_corr[,-idx_na] ref_tbru$Z_orig = ref_tbru$Z_orig[,-idx_na] # not essential ref_tbru$R = ref_tbru$R[,-idx_na]
Can you give that a try? If that doesn't work I can take another look tonight. Thanks!
Hi Joyce,
that was an easy solution and worked well, thanks!
the class::knn
part of the knnPredict
function is a performance bottleneck though, took me ~5 min to run on this large dataset, in comparison mapQuery
only takes a few second. In general symphony/harmony is super fast compared to the integration function of Seurat so it would be nice if this function would be equally fast I think.
hi @mihem,
Glad that worked. For the k-NN function, I agree that it can be suboptimally slow for large reference. What I will say is that Symphony embeddings are agnostic to the downstream inference function. That is, you can train any sort of model (multinomial logistic regression using the glmnet
package, SVM, etc) using the Harmonized PCs from the reference cells as input, and cell type labels as output. In our paper, we found that k-NN (marginally) outperformed the other methods we tried, but these other methods would be a lot faster than k-NN once the models are trained.
As example, multinomial LR would be something like this:
mod.glmnet <- glmnet::cv.glmnet( x = t(reference$Z_corr), y = as.factor(reference$meta_data$cell_type), family = "multinomial", alpha = 0, type.multinomial = "grouped" )
predictions = mod.glmnet %>% predict(t(query$Z), type = 'class', s = 'lambda.min')
@ilyakorsunsky has implemented a slightly more sophisticated version of this for the Fibroblast Atlas from our lab, where each soft cluster learns it's own model, then for a given query cell it incorporates predictions from all soft clusters.
Thanks for the insights. Okay, but training the multinomial LR took extremely long and that's also part of the pipeline. I mean it's not super important, but isn't there a faster C++ implemented knn library that you can use?
Fair point! I have it on the todo list - do you happen to have a suggestion for a faster C++ implemented library?
Cool. Sorry, no real expertise on that. Some googing led me to: https://github.com/eddelbuettel/rcppannoy which is a well-maintained package and is used by Seurat, based on the python Annoy package from Spotify https://github.com/jlmelville/rcpphnsw also well maintained Again no expert, but I guess if those have a similar accuracy in your benchmarking, they are a great choice because they should lead to an enormous (>100x?) speed increase.
Thanks for this great package that is really helpful in single cell analysis.
I struggle with the memory T cell atlas reference, though. Following the tutorial and downloading the annotations from GEO results in 260 missing labels.
This wouldn't be a problem in this huge dataset I think, but after mapQuery, knnPredict.Seurat (I use a Seurat query object) fails with
Error in class::knn(t(ref_obj$Z_corr), Embeddings(query_obj, "harmony"), : no missing values are allowed
Traceback:
3: stop("no missing values are allowed") 2: class::knn(t(ref_obj$Z_corr), Embeddings(query_obj, "harmony"), ref_obj$meta_data[[label_transfer]], k = k) at utils_seurat_symphony.R#452 1: knnPredict.Seurat(tcells, ref_tbru, "cluster_name")
@joycekang Can you maybe provide all labels (same lab, right? ;)) or provide a workaround (filter out cells with missing label?).
Thank you, Mischko