single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
71 stars 27 forks source link

Why those cells are unassigned? #106

Open tewengtong opened 2 months ago

tewengtong commented 2 months ago

Dear team,

Thank you for providing the tool to demultiplex. There are 16 donors in one pool in our data. We have paired genotype to demultiplex. My input vcf contains 10w exonic SNPs. I ran Vireo with Mode 1a, it seems there are still some unassigned cells. Some of them are low-coverage cells, but others look fine to me according to their metrics:

[vireo] Loading cell folder ... [vireo] Loading donor VCF file ... [vireo] 106759 out 106759 variants matched to donor VCF [vireo] Demultiplex 27546 cells to 16 donors with 106759 variants. [vireo] lower bound ranges [-11356250.4, -11356250.4, -11356250.4] [vireo] allelic rate mean and concentrations: [[0.086 0.456 0.857]] [[13504538.2 8695336.1 4566475.7]] [vireo] donor size before removing doublets: donor0 donor1 donor2 donor3 donor4 donor5 donor6 donor7 donor8 donor9 donor10 donor11 donor12 donor13 donor14 donor15 2362 2047 3153 805 694 2378 672 2050 1355 2735 1024 2085 1059 1910 2332 885 [vireo] 49612 out 106759 SNPs selected for ambient RNA detection: ELBO_gain > 55.3 [vireo] Ambient RNA time: 29202.8 sec [vireo] final donor size: H073 H078 H111 H134 H145 H152 H204 H288 H345 H346 H351 H356 H358 H360 H388 H389 doublet unassigned 1165 1636 1382 1728 1223 363 1156 1543 1135 364 1491 546 752 679 496 316 7276 4295 [vireo] All done: 488 min 57.0 sec

unassigned.filter[1:20,c("donor_id","prob_max","prob_doublet","n_vars")] donor_id prob_max prob_doublet n_vars AAACCGACAAATTGCC-1 unassigned 3.20e-01 0.680 1438 AAACGTAAGCTCGTGG-1 unassigned 6.33e-01 0.253 349 AAACTAGAGGCCCTGA-1 unassigned 7.44e-04 0.499 209 AAAGCCGAGCGAATTT-1 unassigned 3.90e-17 0.544 928 AAAGCCTGTATTGAGC-1 unassigned 1.70e-01 0.264 126 AAAGCGAGTACCTTGC-1 unassigned 1.28e-01 0.640 205 AAAGCGTAGCGCACAT-1 unassigned 3.99e-03 0.519 282 AAAGCTAAGCCCTAGG-1 unassigned 6.60e-01 0.340 1077 AAAGGAGAGAAACGCG-1 unassigned 2.30e-01 0.250 105 AAAGGCCCACCGCTAA-1 unassigned 3.27e-01 0.245 80 AAAGGGGCAGCGGTAG-1 unassigned 6.71e-01 0.315 217 AAAGGTATCCAGGATC-1 unassigned 1.61e-125 0.662 4138

My question is: Why those cells are still estimated as unassigned cells even if the n_vars are high? Are those doublets? Should I include those cells for downstream analysis or just delete those cells (should around ~2,000 cells, 10% of the whole library)?

Thank you so much!

huangyh09 commented 1 month ago

Hi, thanks for sharing your experience. Indeed, this looks a bit intriguing. Overall, it looks like working reasonably with having the mean allelic rate of [0.086 0.456 0.857] (while it can be closer to the theoretical value, especially 0.857).

In your experiment setting (if assuming a conventional doublet rate, 27K cells for 27% double rate), you may expect a bit more doublets but recent techniques may be improved. Nonetheless, the unassigned cells are indeed too many. Possibly, you may compare the estimated genotype (with mode 4 adding --forceLearnGT) to the known genotype and see if there is any obvious mismatch between donors.

You can turn off the --callAmbientRNAs to save the major running time.

Yuanhua

tewengtong commented 1 month ago

Dear Yuanhua,

Thank you so much for your patience and suggestions. Indeed, we applied 10X GEM-X 5' library kit for our sample. According to 10X, the doublet rate could be half. However, our samples are human adipose tissue, which may expect to have more doublets.

According to your suggestions. I tried both mode 2 (with GT for all donors) and mode 4(with --forceLearnGT).

For mode 2 my script is: vireo -c "./B1_L1/" -d "filtered_16_samples_GT.vcf.gz" -o "v7" -t GT

For mode 4: vireo -c "./B1_L1/" -d "filtered_16_samples_GT.vcf.gz" -o ./B1_L1/" -t GT --forceLearnGT

Mode 2 identified 4162 unassigned cells and mode 4 identified 3484.

As you suggested to compare their genotype, indeed, I do find several variants are not consistent in the estimated genotype and my known genotype. Could you please give me some suggestions on: 1) Does it mean my genotype file is problematic? 2) Which one should I refer to or what can I do?

Also, I tried to check the potential reasons myself in several ways:

  1. I compared the matrices distribution generated from two modes: image

It seems there are neglectable differences between the results.

  1. Then, I looked at the unassigned identified by mode2 but not in mode 4: image

It seems that mode 4 returned more "symmetrical" and "extreme" results compared to mode 2. Do you know why is that and what can i do?

  1. I also checked what kind of unassigned cells identified by mode2 are, and I found that:

1) 42.8% of them are cells with low count or low genes expressed or high mitochondrial genes.

2) 14.3% of them has a contamination rate > 0.5. Since our samples are from human adipose tissue, i think contamination may be a reason.

3) Other 43% cells: I have no idea what they are:

image

Do you have any idea about this plot? Do you think I can set a threshold for singlets and doublets at 0.7 instead of 0.9?

Thank you so much for your patience! You opinion values a lot to us!

Best, Yihan

huangyh09 commented 1 month ago

Hi, thanks for the detailed follow-up and sorry for the delay. Unfortunately, I don't have concrete answers. Here are some thoughts:

Yuanhua