snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
136 stars 21 forks source link

Need help: Data Preparation and Model Setup for Data Integration Using UCE #23

Closed Yafei611 closed 7 months ago

Yafei611 commented 7 months ago

Hi,

Great work! I really like the idea of building a universal cell embedding. I believe such embedding will greatly facilitate downstream analysis and provide valuable insights by integrating single-cell data!!

I attempted to integrate four datasets by projecting them into the same embedding space using UCE. These datasets include: An internal PBMC CITEseq dataset (internal) PBMC scRNAseq data from a SLE study obtained from Cellxgene (cxgsle) PBMC scRNAseq data from a Covid study obtained from Cellxgene (cxgcvd) The UCE example PBMC10k dataset (pbmc10k)

To start, I randomly sampled 2000 cells from each dataset (raw counts, anndata) and created four smaller subsets. Next, I ran the UCE 33 layers model with a batch size of 25 for each subset. I then concatenated X_uce from the UCE output for each subset and generated a UMAP using the concatenated matrix.

However, upon analysis, it appears that the four PBMC datasets are very separated on the UMAP. This outcome wasn't what I expected. Could you provide some insights or suggestions on how to better prepare the datasets and set up the model properly?

image

Yanay1 commented 7 months ago

From that UMAP it seems the cell type clusters might be matching?

UCE UMAPs from multiple datasets can commonly show dataset specific effects-- these could very well be real effects, especially if they are from different disease states like this example.

Some things to look at would be UMAPs with more coarse resolutions, as well as actual similarity between cell clusters. You could do that using sc.pl.dendrogram for example. You could also double check how good the integration is by for example trying to transfer labels between datasets.

Yafei611 commented 7 months ago

Thank you for your prompt response! Following UMAP shows the cell type labels. I was expecting similar cell type across datasets can be grouped together.

image

I agree, datasets from different studies are often generated using varied pipelines or platforms, leading to real dataset-specific effects. But these real effects are not desired in the downstream analysis. Do you think it is a good idea to perform some sort of alignment or "batch effect" removal on the UCE embedding?

I tried to transfer labels between datasets using a 2-layer NN, and it yielded very good results. I will try UMAP with different resolutions and dendrogram. Thanks a lot!

Yanay1 commented 7 months ago

How many genes does each dataset have?

Yafei611 commented 7 months ago

Here is a Venndiagram show number of genes in each dataset and overlaps. There were about 9k genes left in each UCE processed dataset. image

I tried to filter the internal, cxgsle, and cxgcvd datasets (pbmc10k was excluded due to fewer genes) to retain only about 15k shared genes across the three datasets. Using the filtered datasets as input for UCE produced very similar results.

Yanay1 commented 7 months ago

So for UCE you should not do any gene filtering (try to do maximum genes per dataset). This might change some of the embeddings (but I don't think it will change the dataset specific clustering).

Yafei611 commented 7 months ago

Thank you for the insights! I tried both the filtered and unfiltered datasets and didn't notice significant differences in the UMAP. I'll continue to explore the datasets using UCE and will keep you posted:)