single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
73 stars 27 forks source link

Large number of `doublet` and `unassigned`, most lost from specific donor(s), after removing doublets #25

Open PeteHaitch opened 3 years ago

PeteHaitch commented 3 years ago

Hi,

I'm running cellsnp-lite (v1.2.0) and vireo (~v0.5.5~ v0.5.0 from pip) to demultiplex a 3' 10X scRNA-seq dataset with cells from 8 human donors. These cells were pooled and run over 3 separate 10x captures, so I have appended a unique capture-specific suffix to the CB cell barcode in the BAM file using appendCB (https://github.com/ruqianl/appendCB) to distinguish potential duplicate cell barcodes across captures. The donors are not evenly pooled; I expect 4 donors to have a similar, larger number of cells and the other 4 donors to have a similar, smaller number of cells.

I run cellsnp-lite + vireo roughly as follows:

REGIONSVCF="genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz"

cellsnp-lite --samFile ${BAM} \
             --outDir ${CELLSNPOUTDIR} \
             --regionsVCF ${REGIONSVCF} \
             --barcodeFile barcodes.txt \
             --nproc 20 \
             --minMAF 0.1 \
             --minCOUNT 20 \
             --gzip

vireo --cellData=${CELLSNPOUTDIR} \
      --nDonor=8 \
      --outDir=${VIREOOUTDIR}

[vireo] Loading cell folder ...
[vireo] Demultiplex 62746 cells to 8 donors with 182385 variants.
[vireo] lower bound ranges [-13206657.4, -11809108.8, -11654733.3]
[vireo] allelic rate mean and concentrations:
[[0.028 0.451 0.941]]
[[13738413.9 12786890.7  5384179.4]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3  donor4  donor5  donor6  donor7
10645   5468    2069    11714   13229   740     15329   3553
[vireo] final donor size:
donor0  donor1  donor2  donor3  donor4  donor5  donor6  donor7  doublet unassigned
9380    3082    31      10633   11835   120     14287   3186    8627    1565

You can see in the final few lines that lots of cells are 'lost' following the removal of doubblets, which particularly noticeable for donor1 donor2 and donor5 as a percentage of the initial assignments.

I'm looking for some suggestions on how I can debug this and for possible causes of this behaviour? This is not the only dataset for which I've seen this issue and I can share some other examples if that is helpful.

For the most part I've had a lot of success with vireo so thank you for developing it!

Cheers, Pete

huangyh09 commented 3 years ago

Hi Pete,

Thanks for sharing you experience. the results indeed look not convincing. I noticed that the cell numbers are quite high, this might make the model easier to stuck at local optima. One quick thing you can try is increase the number of initializations by --nInit, e.g., to 200. You can use -p to increase the number of processes for parallel computing, while it will use more memory.

Let me know how it goes. Yuanhua

PeteHaitch commented 3 years ago

Thanks, Yuanhua. I'm re-running with --nInit 200 and I'll report back the results.

PeteHaitch commented 3 years ago

Increasing to --nInit 200 gave me:

[vireo] Loading cell folder ...
[vireo] Demultiplex 62746 cells to 8 donors with 182385 variants.
[vireo] lower bound ranges [-13267522.5, -11771899.8, -11602801.9]
[vireo] allelic rate mean and concentrations:
[[0.031 0.452 0.934]]
[[14038312.6 12274692.6  5596478.8]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3  donor4  donor5  donor6  donor7
2777    3722    11969   1737    11350   1861    16069   13261
[vireo] final donor size:
donor0  donor1  donor2  donor3  donor4  donor5  donor6  donor7  doublet unassigned
2284    3205    10634   1478    9410    36      14445   11851   8455    948
[vireo] All done: 498 min 4.3 sec

Increasing to --nInit 1000 gave me:

[vireo] Loading cell folder ...
[vireo] Demultiplex 62746 cells to 8 donors with 182385 variants.
[vireo] lower bound ranges [-13272912.3, -11791179.9, -11605879.4]
[vireo] allelic rate mean and concentrations:
[[0.027 0.45  0.937]]
[[13773588.1 12622406.3  5513489.7]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3  donor4  donor5  donor6  donor7
15976   2273    4152    1516    10538   11619   13133   3538
[vireo] final donor size:
donor0  donor1  donor2  donor3  donor4  donor5  donor6  donor7  doublet unassigned
14430   25      3352    126     9379    10638   11839   3185    8578    1194

I'm a little concerned that varying --nInit changes results for 'rare' donors so much, e.g., although I might like to believe the --nInit 200 results because only 1 donor has < 1000 cells, I would naively expect that the --nInit 1000 results should be more accurate/precise/reliable.

Aside: the version of vireo I have (v0.5.5 installed from https://pypi.org/project/vireoSNP/) does not have the -p/--nproc option (it's not documented when vireo --help is run and setting it leads to an error vireo: error: no such option: --nproc)

huangyh09 commented 3 years ago

This discordance between --nInit 200 and --nInit 1000 is indeed concerned. You meant these 62746 are from 3 captures, so each has ~21K cells? Eventhough the number of doublets could still be high and some donors may become minor, both of which can increase the challenge in donor identification.

One possibly way is to perform the demultiplexing on each of the three sets separately, and link them. So in each set, the doublets may not be so high.

For the version of vireo, I guess you are still using an older version (you can type vireo in terminal). The PyPI sometimes fails to replace the older version, especially installed from github initially. In the v0.5.5, it should support --nproc.

Yuanhua

PeteHaitch commented 3 years ago

You meant these 62746 are from 3 captures, so each has ~21K cells?

Yes, that's correct.

One possibly way is to perform the demultiplexing on each of the three sets separately, and link them. So in each set, the doublets may not be so high.

This is how I initially performed the demultiplexing. However, I was unsure how to best link the donors between each capture. Could you please suggest how to do this using the vireo output (e.g., which files, fields,etc. to parse and link).

For the version of vireo, I guess you are still using an older version (you can type vireo in terminal). The PyPI sometimes fails to replace the older version, especially installed from github initially. In the v0.5.5, it should support --nproc.

Oh, that's surprising to me but you're correct, I am running vireo 0.5.0

(base) vc7-shared 501 % vireo
Welcome to vireoSNP v0.5.0!
huangyh09 commented 3 years ago

Good to know. You can use the vireoSNP.vcf.match_VCF_samples to align the output donor genotype files GT_donors.vireo.vcf.gz. Example usage here.

The genotype similarity between samples estimated from two batches can be used to diagnose the deconvolution.