single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
71 stars 25 forks source link

identify contaminant population #29

Closed bobermayer closed 2 years ago

bobermayer commented 2 years ago

Hi, I'm wondering if it's possible to identify small contaminant populations using vireo and common variants? say, we have identified a major population of cells and want to find a (likely) small contaminating cell population with a different genotype among the remaining cells. I tried initializing vireoSNP.Vireo with n_donors=2, n_GT=2 and ID_prob_init = [.95,.05] for the "major" population and [.5,.5] for all other cells. I'm assuming that after running vireo_object.fit(AD,DP) I should find the genotype assigment in np.argmax(vireo_object.ID_prob,axis=1). but the fit converges to an approximately even split between inferred genotypes, and the initial assignment is not respected (vireo_object.ID_prior is [.5,.5] for all cells after the fit). do you think that's possible and I'm just not using vireo correctly? (otherwise a really great tool!) thanks!

huangyh09 commented 2 years ago

Thanks. Is this a human sample? If so, n_GT=3 might be more appropriate for GT = 0, 1, and 2. Or you using n_GT=2 for somatic mutations, which might be fine.

ID_prob_init is only for the initialization point, so may not affect the results. If you want to change the prior, you can use vireo_object.set_prior() to do so, before running .fit().

Is it possible to know the genotype for the ordinary cells and the contaminating cells? This may help, even if it is partially available.

Yuanhua

bobermayer commented 2 years ago

hi, thanks a lot for the reply. yes, these are human PBMCs after allogeneic stem cell transplant, and I'm wondering if any host cells are left. from TCR clones I know that some cells are definitely from the donor, but for the remainder it's not clear (also difficult, since host and donor are matched and related). I tried n_GT=3 and vireo_object.set_prior(ID_prob=my_prior) as you suggested. but this doesn't put all the donor cells into the same cluster, unless I set their prior to (strictly) 1 (which throws a RuntimeWarning: divide by zero encountered in log). I don't have genotypes unfortunately, can only use common variants.

huangyh09 commented 2 years ago

Thanks for clarifying this. It indeed fits well with what Vireo aims to do. We can treat the host and donors are the pooled samples and separate them with Vireo. One unusual challenge is that the host cells are the minority.

Have you tried the Vireo command line to separate the cells into two "donors" via the mode 1?

One tricky part is the selection of informative SNPs. Given that the donors are highly imbalanced, maybe you should consider removing SNPs with overall allele frequency around 0.5, which are likely heterozygous SNPs of the major donor. One way is after using cellsnp-lite mode 1, you can filter SNPs with AF, e.g., between 0.2 to 0.8, with BCFtools. Let me know how it works.

bobermayer commented 2 years ago

Hi, thanks a lot for your suggestions, and sorry for late reply.

I tried removing SNPs with overall AF between 0.2 and 0.8 (using just the ratio of row sums of AD and DP). for most of my samples, vireo splits the cells into two clusters of roughly equal size, with the known donor cells distributed randomly. but the assignments are pretty uncertain (np.max(ID_prob,1) is < 0.9 for 80-95% of cells). I guess this means either that there is not enough information to make a call or that there are simply no detectable host cells. would you agree?

btw if I keep the variants with AF around 0.5, I get very similar results, except that more cells are (apparently) confidently assigned (ID_prob > 0.9). maybe given a good null model for the max(ID_prob) distribution one could improve on this cutoff (currently fixed if I understand correctly) and obtain a defined FDR?

huangyh09 commented 2 years ago

Thanks for the updates. Do you have a rough idea of the proportion of cells from the host? It seems the proportion might be very low and cannot be effectively clustered by vireo. If that is the case, obtaining the genotypes of these two donors might be a critical way to improve the demultiplexing, e.g., for a small number of SNPs, e.g., from bulk RNA-seq.

I guess you have tried increasing the number of initialization by -M N_INIT, e.g., 100 or even 200. It usually helps, especially in extremely imbalanced scenarios.

Yuanhua

bobermayer commented 2 years ago

Hi Yuanhua, I'm not sure about the proportion of host cells, but it's probably very low (below 1-5%) or even zero at least for some samples. increasing the number of initializations didn't make a difference, so I guess at this point I can't make progress without getting genotypes. thanks again!