snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
120 stars 15 forks source link

Understanding data integration #19

Closed maestriev closed 4 months ago

maestriev commented 5 months ago

Very nice work!

Let's say we embedded 5 scRNA-seq datasets separately with UCE's 33 layer model.

Do we need to subset so all 5 datasets have the exact same genes in their matrix prior to tokenizing so they will be directly comparable in the UCE space?

If our goal was to integrate datasets in the UCE space (e.g. generate a plot similar to your integrated Mega-scale Atlas but with less datasets) by creating a UMAP could you please provide some code you used ? (I know tabula sapiens v2 is not available yet, just curious to understand the process). Example: Fig. 1B/2B

Could you share how you generated the integrated UMAPs? Did you run each dataset through UCE separately to get the zero-shot embeddings, then concatenate all the .h5ad's now containing the "X_uce" slot and then run scany's default. Would something like this be appropriate:

#merge the 5 datasets into a single adata

#then use the X_uce slot for the integrated umap
sc.pp.neighbors(adata, use_rep='X_uce') 
sc.tl.umap(adata)
Yanay1 commented 5 months ago

Thanks!

For UCE you should not do any gene subsetting. That includes not doing any highly variable gene selection.

After embedding the datasets separately, you can then merge them into one dataset using anndata.concat.

After that, the code you provided to calculate the UMAP would be correct (calculating neighbors on the 'X_uce' space).

Lee951108 commented 4 months ago

Hi, in the process of data integration, I encountered an issue. When integrating different h5ad files, following your guidance, I first embedded each file separately. Then, I used ad.concat to merge the data together. However, the same gene obtained 1028 values separately during embedding. How should I handle this during concatenation?The default is to retain the embedding values from the first file.

Yanay1 commented 4 months ago

How are you concatenating the anndatas?

Could you clarify what you mean by the same gene having 1028 different values?

To combine anndatas, I usually do something like this:

joint_ad = anndata.concat({"dataset_1":ad_1, "dataset_2":ad_2.....}, label="dataset")

Yanay1 commented 4 months ago

What is the exact command that you are running? What are the shapes of the individual anndatas and then the shape of the joined adata?

Yanay1 commented 4 months ago

Sorry I am not sure I understand what the issue is.

When you concatenate anndatas they are stacked in the order of the list you created. So the first cell in the merged anndata should have that highlighted UCE value.

Lee951108 commented 4 months ago

Ok my bad, let me explain my issue. The first question is both three file has a AT1G01080 in the first place, so after the embedding, I got three different embedded AT1601080 vectors. So my question is how should I integrate the three embedded AT1G01080 vector together? Next about the UMAPs, the individual UMAP and joined UMAP both weird. Did I do something wrong or just these single-cell data not suitable for UCE embedding?

Yanay1 commented 3 months ago

I don't see that there are three copies of that gene in those screenshots in the joint anndata? In adatas.var there should only be one copy of that gene.