Closed bobermayer closed 3 years ago
Thanks. Is this a human sample? If so, n_GT=3
might be more appropriate for GT = 0, 1, and 2. Or you using n_GT=2
for somatic mutations, which might be fine.
ID_prob_init
is only for the initialization point, so may not affect the results. If you want to change the prior, you can use vireo_object.set_prior()
to do so, before running .fit()
.
Is it possible to know the genotype for the ordinary cells and the contaminating cells? This may help, even if it is partially available.
Yuanhua
hi, thanks a lot for the reply. yes, these are human PBMCs after allogeneic stem cell transplant, and I'm wondering if any host cells are left. from TCR clones I know that some cells are definitely from the donor, but for the remainder it's not clear (also difficult, since host and donor are matched and related).
I tried n_GT=3
and vireo_object.set_prior(ID_prob=my_prior)
as you suggested. but this doesn't put all the donor cells into the same cluster, unless I set their prior to (strictly) 1 (which throws a RuntimeWarning: divide by zero encountered in log
).
I don't have genotypes unfortunately, can only use common variants.
Thanks for clarifying this. It indeed fits well with what Vireo aims to do. We can treat the host and donors are the pooled samples and separate them with Vireo. One unusual challenge is that the host cells are the minority.
Have you tried the Vireo command line to separate the cells into two "donors" via the mode 1?
One tricky part is the selection of informative SNPs. Given that the donors are highly imbalanced, maybe you should consider removing SNPs with overall allele frequency around 0.5, which are likely heterozygous SNPs of the major donor. One way is after using cellsnp-lite mode 1, you can filter SNPs with AF, e.g., between 0.2 to 0.8, with BCFtools. Let me know how it works.
Hi, thanks a lot for your suggestions, and sorry for late reply.
I tried removing SNPs with overall AF between 0.2 and 0.8 (using just the ratio of row sums of AD and DP). for most of my samples, vireo splits the cells into two clusters of roughly equal size, with the known donor cells distributed randomly. but the assignments are pretty uncertain (np.max(ID_prob,1)
is < 0.9 for 80-95% of cells). I guess this means either that there is not enough information to make a call or that there are simply no detectable host cells. would you agree?
btw if I keep the variants with AF around 0.5, I get very similar results, except that more cells are (apparently) confidently assigned (ID_prob > 0.9). maybe given a good null model for the max(ID_prob)
distribution one could improve on this cutoff (currently fixed if I understand correctly) and obtain a defined FDR?
Thanks for the updates. Do you have a rough idea of the proportion of cells from the host? It seems the proportion might be very low and cannot be effectively clustered by vireo. If that is the case, obtaining the genotypes of these two donors might be a critical way to improve the demultiplexing, e.g., for a small number of SNPs, e.g., from bulk RNA-seq.
I guess you have tried increasing the number of initialization by -M N_INIT
, e.g., 100 or even 200. It usually helps, especially in extremely imbalanced scenarios.
Yuanhua
Hi Yuanhua, I'm not sure about the proportion of host cells, but it's probably very low (below 1-5%) or even zero at least for some samples. increasing the number of initializations didn't make a difference, so I guess at this point I can't make progress without getting genotypes. thanks again!
Hi, I'm wondering if it's possible to identify small contaminant populations using vireo and common variants? say, we have identified a major population of cells and want to find a (likely) small contaminating cell population with a different genotype among the remaining cells. I tried initializing
vireoSNP.Vireo
withn_donors=2
,n_GT=2
andID_prob_init
=[.95,.05]
for the "major" population and[.5,.5]
for all other cells. I'm assuming that after runningvireo_object.fit(AD,DP)
I should find the genotype assigment innp.argmax(vireo_object.ID_prob,axis=1)
. but the fit converges to an approximately even split between inferred genotypes, and the initial assignment is not respected (vireo_object.ID_prior
is[.5,.5]
for all cells after the fit). do you think that's possible and I'm just not using vireo correctly? (otherwise a really great tool!) thanks!