Closed mihem closed 1 year ago
Hi @mihem, thanks for sharing this issue and sorry for the long delay.
You are already very familiar with the method suite. Your second batch is from PBMC (right?) which indeed works very reasonably. For the first batch of CSF samples, it is certainly concerning. The n_variants look OK, but with and without --forceLearnGT
are too different. I guess if you change it to no donor genotype mode, the results will be concordant with --forceLearnGT
.
If that is the case, I would suggest performing some genotype diagnosis: you can compare the difference between your donor_VCF and the estimated genotype from CSF samples, so you will have a 4-by-4 genotype distance matrix. Something like this tutorial: https://nbviewer.org/github/single-cell-genetics/vireo/blob/master/examples/donor_match.ipynb
From your PBMC sample, your donor VCF should be OK. I guess the CSF sample is not likely to be label-swapped but if the above 4-by-4 matrix doesn't suggest a strong donor matching, you may consider assigning the CSF sample to your all donor VCF and see if there are better matched samples.
It may be also because of technical issues in setting Vireo, but I don't have much clue from the results above.
Yuanhua
Hi @huangyh09,
thanks a lot for your response.
As you suggested I ran the same sample (CSF) without the genotype information, using the genome1k.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz
.
It's not possible to say if the assignment is correct (so if the second cell is PNP56 and not PNP52), but at least the same cells that are assigned to PNP57
are all assigned to donor1
and the same is true for the other donors :+1: . So this increases my confidence that the --forceLearnGT
result is correct.
Sorry, I am not sure I udnerstood that. Do you mean fig_GT_Distance_estimated
or do I have to create that manually with Python?
So the fig_GT_distance_estimated
of the genotype without --forceLearnGT
looks like this:
with --forceLearnGT
:
and without genotyping
Sorry, also didn't understand the last part of your response.
I think running without genotyping was a very good idea, which increased my confidence that the --forceLearnGT
result is okay. But another verification would also be nice.
Thank you! Mischko
Thanks Mischko for the quick updates. Indeed, this supports the --forceLearnGT
results.
For the confusion heatmaps, yes I did mean doing it manually between the estimated donor genotypes from Vireo (e.g., from without genotype) vs the known genotype (your genotype/sub.vcf.gz
). Then, you can know how the CSF samples match with the genotyped donors.
Yuanhua
Hi Yuanhua,
thanks again for you help.
Ah now I got it.
res = vireoSNP.vcf.match_VCF_samples(
"/home/xxx/xxx/demultiplex/csf_pool_rna_8_deep_genotype/genotype_bulk/sub.vcf.gz",
"/home/xxx/xxx/demultiplex/csf_pool_rna_8_deep_commonSNPs/vireo_mode1/GT_donors.vireo.vcf.gz",
GT_tag1="PL",
GT_tag2="PL"
)
the resulting heatmap:
So not that clear is an in the tutorial. donor1 is probably PNP57 as I speculated based on the first few cells, but not very convincing and donor2 and donor3 even worse.
Any idea what the issue could be and how it could improve that? Is the forceLearnGT result trustworthy?
You wrote you may consider assigning the CSF sample to your all donor VCF
:
I am not sure I understood that correctly,
I tried tunning the above python script with cellSNP.cells.vcf.gz
instead of sub.vcf.gz
but the result was the same (only took much longer).
I also reran vireo with the genotype informa tion using cellSNP.cells.vcf.gz
instead of sub.vcf.gz
. Same result as in the first post (lots of unassigned and different from the forcelearnGT result).
Thanks, Mischko
Thanks Mischko. This is indeed concerning. The similarity is only moderately higher (diff=0.36) compared to 0.43, and that's why I suggest checking if these CSF donors from another batch (sample label swap), if you have multiple batch in your cohort. One way is simply changing the donor VCF from the four donors to all donors of your cohort, when you run Vireo.
Yuanhua
Thanks for your continous help Yuanhua.
Sorry, I cannot follow. My design is: 32 samples of 16 patients (each one has 1xCSF and 1xblood sample). I multiplexed 4 samples (either CSF or blood) and performed bulk RNA-seq of the blood samples for the genotyping. This worked nicely (if you consider number of unassigned as the metric) in all batches (blood and CSF). Only in this one CSF batch, which we are talking about, there were many unassigned. I found this strange because the demultiplexing of the blood batch using the same genotype information worked nicely, but the scRNA-seq quality of the corresponding CSF batch is also okay i think. So cellSNP.cells.vcf.gz is identical, sub.vcz.gz is different.
Could you explain in more detail how you would rerun vireo to check if the forceLearnGT results are valid? Not sure what you mean by changing the donor VCF to all donors and how that could help.
Thank you so much! Mischko
Thanks Mischko for clarifying this. I think your reasoning is all logical.
I don't have a strong clue yet but I would simply try demultiplexing the 4-donor CSF to all 16 patients (so the donor VCF should contain 16 donors) and see if any other patients may match better, just in case of some mysterious mismatch.
Yuanhua
Thanks Yuanhua.
Ah, now I understand. But this would be computationally quite expensive, and since all other batches worked well this is not that likely I think.
But thanks for your input, I think the idea with using commonSNP was really good, so I will continue with the forceLearnGT result and try to see if the scRNA-seq results are plausible biologically!
Agreed your plan is very reasonable.
Just a quick thought came into my mind that some cells may have a high proportion of ambient RNAs. In case you suspect too, you can try adding --callAmbientRNAs
argument in Vireo. You will see a file prop_ambient.tsv
containing each cell's proportional composition from the 4 donors. It is similar to the probability from each donor but different meaning. For example, for doublets, you will likely see a high proportion (around 0.5) from two donors. If you see a cell has an unneglectable proportion (e.g., >0.15) from more than two donors, you may consider the cell contains too much ambient RNA.
We implemented this two years ago but didn't really publish it. It was found comparable performance to the original strategy in detecting doublets but took longer running time. Nevertheless, it gives complementary information to the existing detection of doublets. https://vireosnp.readthedocs.io/en/latest/release.html#release-v0-5-3-28-03-2021
Yuanhua
Hi Yuanhua,
ah interesting option, never saw that before. And it's easy and fast to run
prop_ambient
genotype (without forceLearnGT)
with forceLearnGT
So if I understand correctly, in these first few cells ambient RNA is not the main problem, with the exception of AAAGATGCATTTCAGG-1
the result is pretty clear and mostly is in accordance with the without genotype version (which i think is wrong) and different from the forceLearnGT option. However, cells that are annotated doublets
in the genotype without forceLearnGT version don't really have a high proportion from multiple donors.
Best, Mischko
Hi,
we use bulk-RNA-seq, followed by cellsnp-lite (mode2b) to generate genotyping, and vireo to demultiplex. We multiplex 4 patients and this works quite well in most cases.
However, in one case there wer a lot of unassigned cells despite sufficient amount of SNPs.
(n optional because number of samples we have genotypie in all 4 patients)
When using forceLearnGT
there are much fewer unassigned, but the cell assignments are also very different, which conerns me.
I tried using a higher
--nInit=200
which lead to similar results, so the model seems to be stable?There are actually cerebrospinal fluid from patients, and we have PBMC from the same 4 patients. And here demultiplexing with the same genotype information works fine without forceLearnGT(as you would expect when all genotypes are available):
and
--forceLearnGT
leads to very simliar results:Maybe @huangyh09 can give me some guidance?
Thank you!!