single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
71 stars 25 forks source link

Strategies for validating output #47

Open hongdavid94 opened 2 years ago

hongdavid94 commented 2 years ago

Hello,

Thank you for creating this useful package (and also the genotyping package, cellsnp-lite) that has been critical to my work in bone marrow transplant research.

I have ran the vireo package on a bone marrow data that should have cells from two different individuals (donor and recipient). Out of curiosity, I tried running the package by forcing the -N parameter to be 3 and 4. The cells were still confidently assigned to donor3 and donor4.

I also ran demultiplexing on a bone marrow data that should only have cells from one individual (no transplant). Again, out of curiosity, I tried running the package by forcing the -N parameter to be 2, and the cells were assigned to two individuals, at almost a 50:50 ratio.

Do you have any suggestions in validating the genotyped + demultiplexed results?

(I am using SNP genotyping use the f>= 5e4 SNP set genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz)

(for genotyping using cellsnp-lite, I am using SNPs with MAF>0.1, the mean reads per cell for the two sample mentioned above are 89,000 and 99,000)

vireoSNP version is 0.5.7, cellsnp-lite version is 1.2.2

huangyh09 commented 2 years ago

Hi,

Thanks for sharing your trials! It's actually not too surprising that the model can find unwanted (over) sub-groups. Generally, we have two suggestions to check if it's over clustered, i.e., the "-N" is too large (in case the user doesn't know the real N):

1) check the genotype difference between estimated donors (i.e., clusters) in the "fig_GT_distance_estimated.pdf". If two donors have a genotype difference less than, say, 0.15, it might over clustered. 2) check the ELBO in each N. The ELBO will increase when N increases, but will be less dramatically after the optimal N.

In your case, if you want to validate the demultiplexed results, you can compare the estimated genotypes to other probes, e.g., SNP array or even PCR on a few SNPs. Generally, we think it's reliable, especially if it returns consistent results by multiple runs.

Yuanhua

hongdavid94 commented 2 years ago

Thank you so much Yuanhua, I will follow your suggestions and follow up!

racng commented 2 years ago

If you know the biological sex of donors, you can look for expression of sex-specific genes like XIST (for female) and RPS4Y1 (for male). Doublets should have a higher chance of being doublet positive for these genes. The female donors would have more cells that are XIST+ and vice versa. I find that XIST is quite lowly expressed so you may have a lot of double negative cells.

hongdavid94 commented 2 years ago

If you know the biological sex of donors, you can look for expression of sex-specific genes like XIST (for female) and RPS4Y1 (for male). Doublets should have a higher chance of being doublet positive for these genes. The female donors would have more cells that are XIST+ and vice versa. I find that XIST is quite lowly expressed so you may have a lot of double negative cells.

Thank you, yes your strategy came to me my mind first but we don't know the biological sex of donor/recipient. and the expression levels of XIST and RPS4Y1 are not differentially expressed in the two groups so we cannot take advantage of the biological differences in sex.