single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
74 stars 27 forks source link

ForceLearnGT leads to very different result #88

Closed mihem closed 1 year ago

mihem commented 1 year ago

Hi,

we use bulk-RNA-seq, followed by cellsnp-lite (mode2b) to generate genotyping, and vireo to demultiplex. We multiplex 4 patients and this works quite well in most cases.

However, in one case there wer a lot of unassigned cells despite sufficient amount of SNPs.

vireo -c demux_genotype -d genotype/sub.vcf.gz -o vireo -N 4

(n optional because number of samples we have genotypie in all 4 patients)

image

When using forceLearnGT

vireo -c demux_genotype -d genotype/sub.vcf.gz -o vireo -N 4 --forceLearnGT

there are much fewer unassigned, but the cell assignments are also very different, which conerns me.

image

I tried using a higher --nInit=200 which lead to similar results, so the model seems to be stable?

There are actually cerebrospinal fluid from patients, and we have PBMC from the same 4 patients. And here demultiplexing with the same genotype information works fine without forceLearnGT(as you would expect when all genotypes are available):

image

and --forceLearnGT leads to very simliar results:

image

Maybe @huangyh09 can give me some guidance?

Thank you!!

huangyh09 commented 1 year ago

Hi @mihem, thanks for sharing this issue and sorry for the long delay.

You are already very familiar with the method suite. Your second batch is from PBMC (right?) which indeed works very reasonably. For the first batch of CSF samples, it is certainly concerning. The n_variants look OK, but with and without --forceLearnGT are too different. I guess if you change it to no donor genotype mode, the results will be concordant with --forceLearnGT.

If that is the case, I would suggest performing some genotype diagnosis: you can compare the difference between your donor_VCF and the estimated genotype from CSF samples, so you will have a 4-by-4 genotype distance matrix. Something like this tutorial: https://nbviewer.org/github/single-cell-genetics/vireo/blob/master/examples/donor_match.ipynb

From your PBMC sample, your donor VCF should be OK. I guess the CSF sample is not likely to be label-swapped but if the above 4-by-4 matrix doesn't suggest a strong donor matching, you may consider assigning the CSF sample to your all donor VCF and see if there are better matched samples.

It may be also because of technical issues in setting Vireo, but I don't have much clue from the results above.

Yuanhua

mihem commented 1 year ago

Hi @huangyh09,

thanks a lot for your response.

As you suggested I ran the same sample (CSF) without the genotype information, using the genome1k.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz.

image

It's not possible to say if the assignment is correct (so if the second cell is PNP56 and not PNP52), but at least the same cells that are assigned to PNP57 are all assigned to donor1 and the same is true for the other donors :+1: . So this increases my confidence that the --forceLearnGT result is correct.

Sorry, I am not sure I udnerstood that. Do you mean fig_GT_Distance_estimated or do I have to create that manually with Python?

So the fig_GT_distance_estimated of the genotype without --forceLearnGT looks like this:

image

with --forceLearnGT:

image

and without genotyping

image

Sorry, also didn't understand the last part of your response.

I think running without genotyping was a very good idea, which increased my confidence that the --forceLearnGT result is okay. But another verification would also be nice.

Thank you! Mischko

huangyh09 commented 1 year ago

Thanks Mischko for the quick updates. Indeed, this supports the --forceLearnGT results.

For the confusion heatmaps, yes I did mean doing it manually between the estimated donor genotypes from Vireo (e.g., from without genotype) vs the known genotype (your genotype/sub.vcf.gz). Then, you can know how the CSF samples match with the genotyped donors.

Yuanhua

mihem commented 1 year ago

Hi Yuanhua,

thanks again for you help.

Ah now I got it.

res = vireoSNP.vcf.match_VCF_samples(
    "/home/xxx/xxx/demultiplex/csf_pool_rna_8_deep_genotype/genotype_bulk/sub.vcf.gz",
    "/home/xxx/xxx/demultiplex/csf_pool_rna_8_deep_commonSNPs/vireo_mode1/GT_donors.vireo.vcf.gz",
    GT_tag1="PL",
    GT_tag2="PL"
)

the resulting heatmap:

image

So not that clear is an in the tutorial. donor1 is probably PNP57 as I speculated based on the first few cells, but not very convincing and donor2 and donor3 even worse.

Any idea what the issue could be and how it could improve that? Is the forceLearnGT result trustworthy?

You wrote you may consider assigning the CSF sample to your all donor VCF: I am not sure I understood that correctly, I tried tunning the above python script with cellSNP.cells.vcf.gz instead of sub.vcf.gz but the result was the same (only took much longer). I also reran vireo with the genotype informa tion using cellSNP.cells.vcf.gz instead of sub.vcf.gz. Same result as in the first post (lots of unassigned and different from the forcelearnGT result).

Thanks, Mischko

huangyh09 commented 1 year ago

Thanks Mischko. This is indeed concerning. The similarity is only moderately higher (diff=0.36) compared to 0.43, and that's why I suggest checking if these CSF donors from another batch (sample label swap), if you have multiple batch in your cohort. One way is simply changing the donor VCF from the four donors to all donors of your cohort, when you run Vireo.

Yuanhua

mihem commented 1 year ago

Thanks for your continous help Yuanhua.

Sorry, I cannot follow. My design is: 32 samples of 16 patients (each one has 1xCSF and 1xblood sample). I multiplexed 4 samples (either CSF or blood) and performed bulk RNA-seq of the blood samples for the genotyping. This worked nicely (if you consider number of unassigned as the metric) in all batches (blood and CSF). Only in this one CSF batch, which we are talking about, there were many unassigned. I found this strange because the demultiplexing of the blood batch using the same genotype information worked nicely, but the scRNA-seq quality of the corresponding CSF batch is also okay i think. So cellSNP.cells.vcf.gz is identical, sub.vcz.gz is different.

Could you explain in more detail how you would rerun vireo to check if the forceLearnGT results are valid? Not sure what you mean by changing the donor VCF to all donors and how that could help.

Thank you so much! Mischko

huangyh09 commented 1 year ago

Thanks Mischko for clarifying this. I think your reasoning is all logical.

I don't have a strong clue yet but I would simply try demultiplexing the 4-donor CSF to all 16 patients (so the donor VCF should contain 16 donors) and see if any other patients may match better, just in case of some mysterious mismatch.

Yuanhua

mihem commented 1 year ago

Thanks Yuanhua.

Ah, now I understand. But this would be computationally quite expensive, and since all other batches worked well this is not that likely I think.

But thanks for your input, I think the idea with using commonSNP was really good, so I will continue with the forceLearnGT result and try to see if the scRNA-seq results are plausible biologically!

huangyh09 commented 1 year ago

Agreed your plan is very reasonable.

Just a quick thought came into my mind that some cells may have a high proportion of ambient RNAs. In case you suspect too, you can try adding --callAmbientRNAs argument in Vireo. You will see a file prop_ambient.tsv containing each cell's proportional composition from the 4 donors. It is similar to the probability from each donor but different meaning. For example, for doublets, you will likely see a high proportion (around 0.5) from two donors. If you see a cell has an unneglectable proportion (e.g., >0.15) from more than two donors, you may consider the cell contains too much ambient RNA.

We implemented this two years ago but didn't really publish it. It was found comparable performance to the original strategy in detecting doublets but took longer running time. Nevertheless, it gives complementary information to the existing detection of doublets. https://vireosnp.readthedocs.io/en/latest/release.html#release-v0-5-3-28-03-2021

Yuanhua

mihem commented 1 year ago

Hi Yuanhua,

ah interesting option, never saw that before. And it's easy and fast to run

prop_ambient image

genotype (without forceLearnGT)

image

with forceLearnGT

image

So if I understand correctly, in these first few cells ambient RNA is not the main problem, with the exception of AAAGATGCATTTCAGG-1 the result is pretty clear and mostly is in accordance with the without genotype version (which i think is wrong) and different from the forceLearnGT option. However, cells that are annotated doublets in the genotype without forceLearnGT version don't really have a high proportion from multiple donors.

Best, Mischko