Closed ccruizm closed 3 years ago
Hi Chris,
Could you please past in the architechture details of the model here? So you have couple of data from normal Rna-seq and one from sn-rna? What do you use as batch key? Datasets?
Thanks for the quick reply. This is how I built the reference:
condition_key = "author"
adata.obs['author'] = adata.obs['author'].astype('category')
adata = sca.data.normalize_hvg(adata,batch_key=condition_key,n_top_genes=2000, logtrans_input = False)
# I normalized the data previously, so I set logtrans_input = False
Using 2 HVGs from full intersect set
Using 8 HVGs from n_batch-1 set
Using 52 HVGs from n_batch-2 set
Using 92 HVGs from n_batch-3 set
Using 141 HVGs from n_batch-4 set
Using 174 HVGs from n_batch-5 set
Using 221 HVGs from n_batch-6 set
Using 287 HVGs from n_batch-7 set
Using 421 HVGs from n_batch-8 set
Using 602 HVGs from n_batch-9 set
Using 2000 HVGs
network = sca.models.scArches(task_name='atlas',
x_dimension=adata.shape[1],
z_dimension=10,
architecture=[128, 128],
gene_names=adata.var_names.tolist(),
conditions=adata.obs[condition_key].unique().tolist(),
alpha=0.001,
loss_fn='nb',
model_path="./models/scArches/",
)
network.train(adata,
condition_key=condition_key,
n_epochs=100,
batch_size=128,
save=True,
retrain=True)
latent_adata = network.get_latent(adata, condition_key)
sc.pp.neighbors(latent_adata)
With lower alpha and higher epochs, I get a more detailed sub clustering, but the dataset generated by nuclei is always standing out. So far, only one dataset was processed by nuclei but, later on, will include others with the same technique and want a reference that can be used to query either cells or nuclei experiments.
I would suggest making the model deeper [128, 128,128] and make the batch size also smaller: 32 or max 64. I guess the nb
loss does not fit nuclei experiments. try zinb
loss and see if it works.
Do all your experiment have the same genes? it is necessary to have the exact gene set.
finally, if it did not work try switch to sse
or mse
loss this would prob solve the problem.
let me know how it goes, please.
Mo
Thanks for the suggestions. Will test them. When you mentioned that the experiment must have the same exact gene set, you meant the intersected/shared genes among all datasets? (I have seen this approach when using scIB
https://github.com/theislab/scib/blob/master/notebooks/data_preprocessing/pancreas/01_collect_human_pancreas_studies.ipynb) or could be the merged genes for all studies? I merged the matrices using Seurat
, so not all genes are present in all samples initially, but after merging all of them have the same features, and the ones that were not common in the beginning, are filled with zeros.
I have been using the last one, since the reference I am building is for a tumor type (and expect higher heterogeneity compared with a normal tissue) and if I intersect the features across all the studies I end up with only ~5K common genes. What would be your recommendation in this case?
yes, I meant intersection of genes, would also worth to give it a chance since there might be a lot of genes which are zero in nuclei data but not zero in other one and vice versa, therefore, there would not be a share feature set. And even after hvg selection, there might be genes that are the only hvg in nuclei in one and not the other one. Or increases the number of hvgs to 5k or sth like that to increase the chance.
therefore I would suggest :
increase the HVG set to 5000 or so and check again with different loss functions
try to only subset to the intersection when you concat datasets and rerun models and see.
Good day,
I am excited to use this tool for building a reference for my study. However, I am having problems integrating single cell and nuclei datasets. When I used the standard integration method by
Seurat
, I get the expected output butscArches
is not able to integrate the data coming from snRNAseq. For both pipelines, I used as batch key the different datasets. See attached picture.What do you think the issue can be? thanks in advance!