Problems integrating cell and nuclei datasets

ccruizm commented 3 years ago

Good day,

I am excited to use this tool for building a reference for my study. However, I am having problems integrating single cell and nuclei datasets. When I used the standard integration method by Seurat, I get the expected output but scArches is not able to integrate the data coming from snRNAseq. For both pipelines, I used as batch key the different datasets. See attached picture.

What do you think the issue can be? thanks in advance!

M0hammadL commented 3 years ago

Hi Chris,

Could you please past in the architechture details of the model here? So you have couple of data from normal Rna-seq and one from sn-rna? What do you use as batch key? Datasets?

ccruizm commented 3 years ago

Thanks for the quick reply. This is how I built the reference:

condition_key = "author"
adata.obs['author'] = adata.obs['author'].astype('category')

adata = sca.data.normalize_hvg(adata,batch_key=condition_key,n_top_genes=2000, logtrans_input = False)
# I normalized the data previously, so I set logtrans_input = False

Using 2 HVGs from full intersect set
Using 8 HVGs from n_batch-1 set
Using 52 HVGs from n_batch-2 set
Using 92 HVGs from n_batch-3 set
Using 141 HVGs from n_batch-4 set
Using 174 HVGs from n_batch-5 set
Using 221 HVGs from n_batch-6 set
Using 287 HVGs from n_batch-7 set
Using 421 HVGs from n_batch-8 set
Using 602 HVGs from n_batch-9 set
Using 2000 HVGs

network = sca.models.scArches(task_name='atlas',
                              x_dimension=adata.shape[1],
                              z_dimension=10,
                              architecture=[128, 128],
                              gene_names=adata.var_names.tolist(),
                              conditions=adata.obs[condition_key].unique().tolist(),
                              alpha=0.001,
                              loss_fn='nb',
                              model_path="./models/scArches/",
                              )

network.train(adata,
              condition_key=condition_key,
              n_epochs=100,
              batch_size=128,
              save=True,
              retrain=True)

latent_adata = network.get_latent(adata, condition_key)
sc.pp.neighbors(latent_adata)

With lower alpha and higher epochs, I get a more detailed sub clustering, but the dataset generated by nuclei is always standing out. So far, only one dataset was processed by nuclei but, later on, will include others with the same technique and want a reference that can be used to query either cells or nuclei experiments.

M0hammadL commented 3 years ago

I would suggest making the model deeper [128, 128,128] and make the batch size also smaller: 32 or max 64. I guess the nb loss does not fit nuclei experiments. try zinb loss and see if it works. Do all your experiment have the same genes? it is necessary to have the exact gene set.

finally, if it did not work try switch to sse or mse loss this would prob solve the problem.

let me know how it goes, please.

Mo

ccruizm commented 3 years ago

Thanks for the suggestions. Will test them. When you mentioned that the experiment must have the same exact gene set, you meant the intersected/shared genes among all datasets? (I have seen this approach when using scIB https://github.com/theislab/scib/blob/master/notebooks/data_preprocessing/pancreas/01_collect_human_pancreas_studies.ipynb) or could be the merged genes for all studies? I merged the matrices using Seurat, so not all genes are present in all samples initially, but after merging all of them have the same features, and the ones that were not common in the beginning, are filled with zeros.

I have been using the last one, since the reference I am building is for a tumor type (and expect higher heterogeneity compared with a normal tissue) and if I intersect the features across all the studies I end up with only ~5K common genes. What would be your recommendation in this case?

M0hammadL commented 3 years ago

yes, I meant intersection of genes, would also worth to give it a chance since there might be a lot of genes which are zero in nuclei data but not zero in other one and vice versa, therefore, there would not be a share feature set. And even after hvg selection, there might be genes that are the only hvg in nuclei in one and not the other one. Or increases the number of hvgs to 5k or sth like that to increase the chance.

therefore I would suggest :

increase the HVG set to 5000 or so and check again with different loss functions
try to only subset to the intersection when you concat datasets and rerun models and see.

theislab / scarches

Problems integrating cell and nuclei datasets #33