scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.82k stars 586 forks source link

ingest after bbknn produces poor results #1833

Open yyoshiaki opened 3 years ago

yyoshiaki commented 3 years ago

Hi, I tried ingest using the reference made with BBKNN. As @ivirshup said, ingest was worked by adding adata_ref.uns['neighbors']['params']['metric'] = 'euclidean'. However, the result was quite poor.

image

In contrast, if I merge all datasets (eg references and query), it worked well, but when we want to take over the reference embedding, I actually want to use ingest rather than run bbknn again. Is there any option to feed in this case? or should I ask this in the BBKNN repo?

image

This is the notebook can reproduce the problem. https://nbviewer.jupyter.org/github/yyoshiaki/ingest_after_bbknn/blob/main/notebook.ipynb https://github.com/yyoshiaki/ingest_after_bbknn/blob/main/notebook.ipynb

Originally posted by @yyoshiaki in https://github.com/theislab/scanpy/issues/1122#issuecomment-838476193

ivirshup commented 3 years ago

I'm not completely sure it makes sense to me to run ingest while trying to use the bbknn neighbor graph. @Koncopd, do you have any thoughts here?

Koncopd commented 3 years ago

I am also not sure. You run bbknn for the reference, but then the standard knn for the query. I don't think it makes much sense.

yyoshiaki commented 3 years ago

Thank you. For example, I'm considering a situation that first I create a reference dataset using bbknn, then map bulk RNAseq data onto the reference. I'm happy if I classify the bulk RNA dataset without rearranging the reference embedding. Is it possible to achieve this with good quality?

ivirshup commented 3 years ago

Some questions that you may want to consider: Does the bulk RNA-seq dataset actually contain "pure" cell types? Is it possible a deconvolution approach would make more sense? If you're looking for classification, is projection into the umap embedding important?

As a side note: sc.datasets.pbmc3k and sc.datasets.pbmc3k_processed are the same dataset.

yyoshiaki commented 3 years ago

Yes, I'm assuming pure cells with previously known as a cell type. As you suggest, deconvolution is a nice idea. I didn't hit upon the idea to deconvolve pure cells to examine whether the cell pool is truly pure and which cells on a single cell experiment are the source of the pure cell pool. I thought it is straightforward to project bulk cells on the umap. I also tried the same workflow to ingest public data on our dataset consisted of several experiments with bbknn batch correction, but similarly, the ingestion is not good as the notebook above. In these cases, I'm believing these are important for the visualization and the acquisition of the known cell labels. Is it possible to attain these by `ingest?

I'm sorry that the example may make the question ambiguous. And I apologize that I confused the dataset, too.