Open txemaheredia opened 5 months ago
Hi, thanks for sending the detailed info (& trust).
The first thing I noticed is the unusual allele rate: [0.117 0.281 0.468], instead of around [0, 0.5, 1]. After reading your commands, one thought came into my mind:
To fix it, you can try randomly selecting half of the variants and flipping the ALT and REF alleles, for example by
AD_new[idx_to_flip, :] = DP[idx_to_flip, :] - AD[idx_to_flip, :]
)Hope this may help you.
Yuanhua
Thanks for the suggestions.
I tried a (crude version of) what you suggested flipping the ALT/REF for half the variants and got similarly bad results with a flipped allele rate:
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-1900092.9, -1899299.3, -1898714.2]
[vireo] allelic rate mean and concentrations:
[[0.551 0.858 1. ]]
[[2305018.1 1158633.8 3615499.2]]
[vireo] donor size before removing doublets:
donor0 donor1 donor2 donor3
6444 6317 6217 6669
[vireo] final donor size:
donor0 donor1 donor2 donor3 doublet unassigned
1187 1165 1126 1220 35 20914
[vireo] All done: 6 min 10.8 sec
I have delved a bit into this, and it seems that these samples are first generation hybrids between C57BL6 and FVB mice strains. That means that vireo's underlying allele rate assumption will never be true for these samples.
I tried downloading the strain-specific VCFs from the Mouse Genomes Project and limit the analysis to either the merged SNV set, or the intersection SNV set between both strains.
Merged:
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 28266 variants.
[vireo] lower bound ranges [-2385086.2, -2384185.6, -2382473.9]
[vireo] allelic rate mean and concentrations:
[[0.243 0.509 0.693]]
[[ 686241.5 3465441.8 501250.7]]
[vireo] donor size before removing doublets:
donor0 donor1 donor2 donor3
6267 6582 6476 6321
[vireo] final donor size:
donor0 donor1 donor2 donor3 doublet unassigned
1549 1659 1635 1542 182 19080
[vireo] All done: 3 min 41.6 sec
Decent amount of SNV, bad allelic rates, poor classification.
Intersect:
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 280 variants.
[vireo] lower bound ranges [-28465.8, -26967.1, -26680.3]
[vireo] allelic rate mean and concentrations:
[[0.001 0.307 0.897]]
[[ 3680.8 46416.6 4968.6]]
[vireo] donor size before removing doublets:
donor0 donor1 donor2 donor3
6431 6391 6392 6432
[vireo] final donor size:
donor0 donor1 donor2 donor3 unassigned
12 19 20 20 25576
[vireo] All done: 0 min 8.8 sec
Extremely low number of SNV, better allelic ratios, no donor classification power.
Can you think of any way to make vireo work with these kind of samples? Otherwise, do you know of any other similar tool that could make them work?
Thank you very much.
For your last strategy, maybe you can double-check with a similar one used in this paper:
Demultiplexing of 10x Data Genotyping information for the C3H_HeJ, CAST_EiJ and C57BL_6NJ mouse strains were extracted from the Mouse Genome Project (Keane et al., 2011) dataset. The SNPs were filtered to identify those which w ere heterozygous in at least one of the three strains (25.7 million in total). These were used as candidates to genotype all of the cells in each pool using cellSNP v0.1.7 (Huang et al., 2019), parameters “–minMAF 0.1–minCOUNT 20.” 76,000 to 111,000 informative SNPs were obtained from the pooled scRNA-seq data, these were utilized further in Vireo v0.2.2 (Huang et al., 2019) in the genotype reference free mode with parameters “-N 4 -M 100” to de multiplex the pools. The estimated genotypes for these strains were mapped back to the three known genotypes from the Mouse Genome Project to link the cell lines to their parental mouse strain.
I've just tried that:
bcftools view -i 'GT="het"' ${invcf} | bgzip > $outvcf
$ zcat merged_heterozygous_FVB_C57BL6NJ.vcf.gz | grep -v "^#" | wc -l
573115
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 886 variants.
[vireo] lower bound ranges [-72189.6, -70781.1, -69979.7]
[vireo] allelic rate mean and concentrations:
[[0.018 0.33 0.918]]
[[ 18630.4 107414. 10785.6]]
[vireo] donor size before removing doublets:
donor0 donor1 donor2 donor3
6463 6483 6323 6378
[vireo] final donor size:
donor0 donor1 donor2 donor3 doublet unassigned
87 77 82 54 4 25343
[vireo] All done: 0 min 13.1 sec
Similarly to using only the "intersected" SNV, using these variants give good allelic ratios, but they are simply not enough of them to classify donors.
> m %>%
group_by(donor_id) %>%
summarize(min = min(n_vars),
mean = mean(n_vars),
median = median(n_vars),
max = max(n_vars))
# A tibble: 6 × 5
donor_id min mean median max
<chr> <int> <dbl> <dbl> <int>
1 donor0 10 20.1 17 46
2 donor1 10 21.9 20 58
3 donor2 10 19.5 17 54
4 donor3 10 22.1 20.5 48
5 doublet 29 36.8 39 40
6 unassigned 0 3.85 3 61
I've just tried that:
bcftools view -i 'GT="het"' ${invcf} | bgzip > $outvcf $ zcat merged_heterozygous_FVB_C57BL6NJ.vcf.gz | grep -v "^#" | wc -l 573115
[vireo] Loading cell folder ... [vireo] Demultiplex 25647 cells to 4 donors with 886 variants. [vireo] lower bound ranges [-72189.6, -70781.1, -69979.7] [vireo] allelic rate mean and concentrations: [[0.018 0.33 0.918]] [[ 18630.4 107414. 10785.6]] [vireo] donor size before removing doublets: donor0 donor1 donor2 donor3 6463 6483 6323 6378 [vireo] final donor size: donor0 donor1 donor2 donor3 doublet unassigned 87 77 82 54 4 25343 [vireo] All done: 0 min 13.1 sec
Similarly to using only the "intersected" SNV, using these variants give good allelic ratios, but they are simply not enough of them to classify donors.
> m %>% group_by(donor_id) %>% summarize(min = min(n_vars), mean = mean(n_vars), median = median(n_vars), max = max(n_vars)) # A tibble: 6 × 5 donor_id min mean median max <chr> <int> <dbl> <dbl> <int> 1 donor0 10 20.1 17 46 2 donor1 10 21.9 20 58 3 donor2 10 19.5 17 54 4 donor3 10 22.1 20.5 48 5 doublet 29 36.8 39 40 6 unassigned 0 3.85 3 61
Hello, I have also encountered this problem. Have you solved it
No. We ended up considering these samples a lost cause and we threw them away.
We focused our analysis on a different set of samples that had a "better genetic background" and were only a mixture of 2 animals. We were able to use vireo + souporcell + sex gene information to demultiplex those samples.
No. We ended up considering these samples a lost cause and we threw them away.
Sorry to hear this. If you want to share this data (email me the link yuanhua@hku.hk), I may give it a try when I have time and see if there is anything we can help.
Yuanhua
Hi,
first of all, thank you for this tool. It has really saved our ass in a different experiment where HTO-based demultiplexing failed. Thank you a lot.
I am running now vireo on two 10x sequencing runs, each containing 4 samples (mouse, littermates with WT/mutant genotypes, 3F/1M or 1F/3M in each run).
I created a VCF file from the single cell sequencing on each of the samples with:
This resulted in 60,797 SNV for the first sequencing run (25,647 cells) and 106,960 for the second one (19,207 cells).
Then I run vireo using the same command I had success with in a previous analysis:
However, the results I got have an overwhelming amount of "unassigned" samples:
Looking a other posts here I also tried to run it with
--callAmbientRNAs
:with
--callAmbientRNAs -M 200
:And with
--callAmbientRNAs -M 1000
:with identical results.
I also explored the results of the first run in R and I see the following distribution:
With most unassigned cells having moderate values of prob_max and prob_doublet < 0.25
Also, the values for the unassigned cells overlap those of donor-assigned cells:
26 unassigned cells have prob_max = 0.9, which, after looking into the code (io_utils.py), I don't understand how were they deemed "unassigned" because they all have n_vars > 10.
It is possible that these cells "suffered" a lot during the processing and there is a lot of ambient RNA floating around. How should I deal with all this? Should I just use a lower prob_max threshold and maybe use a different doublet finder software down the line?
PS: running vireo with --noDoublet leaves 4,726 unassigned cells. Oh, and in the previous runs, 6,371 unassigned cells have their
best_singlet
not in their ownbest_doublet
list.