single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
71 stars 25 forks source link

Most cells are unassigned and a few of them do not match the filters #99

Open txemaheredia opened 5 months ago

txemaheredia commented 5 months ago

Hi,

first of all, thank you for this tool. It has really saved our ass in a different experiment where HTO-based demultiplexing failed. Thank you a lot.

I am running now vireo on two 10x sequencing runs, each containing 4 samples (mouse, littermates with WT/mutant genotypes, 3F/1M or 1F/3M in each run).

I created a VCF file from the single cell sequencing on each of the samples with:

samtools view \
                -@ 8 \
                --write-index \
                -D "${barcode_tag}":<(zcat "${barcodes_tsv_filename}") \
                -o "${output_bam_filename}" \
                "${input_bam_filename}"

cellsnp-lite -s $BAM -O $CELLSNP_OUTDIR -p 14 --minMAF 0.1 --minCOUNT 20 --cellTAG None --UMItag UB --maxDEPTH 500000 --gzip --chrom 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,MT,X,Y

#1a - 2nd step - genotyping
cellsnp-lite -s $BAM -b $BARCODES -O $CELLSNP_GENO_OUTDIR -R $CELLSNP_OUTDIR/cellSNP.base.vcf.gz -p 14 --minMAF 0.1 --minCOUNT 20 --gzip --chrom 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,MT,X,Y

This resulted in 60,797 SNV for the first sequencing run (25,647 cells) and 106,960 for the second one (19,207 cells).

Then I run vireo using the same command I had success with in a previous analysis:

N=2
vireo -c $CELLSNP_GENO_OUTDIR -N $N -o $VIREO_OUTDIR

However, the results I got have an overwhelming amount of "unassigned" samples:

vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-3509690.7, -3508383.5, -3506645.0]
[vireo] allelic rate mean and concentrations:
[[0.117 0.279 0.468]]
[[1676287.6 1369376.8 4033486.6]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6456    6757    6349    6085
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
2204    2358    2216    2013    179 16677
[vireo] All done: 6 min 23.1 sec

Looking a other posts here I also tried to run it with --callAmbientRNAs :

$ cat nohup.cellsnp_vireo_v2_SSt_L1_4_ambient.out
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-3509676.9, -3508213.7, -3507234.5]
[vireo] allelic rate mean and concentrations:
[[0.117 0.281 0.467]]
[[1684690.7 1350985.6 4043474.7]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6631    6408    6236    6371
[vireo] 8151 out 60797 SNPs selected for ambient RNA detection: ELBO_gain > 53.4
[vireo] Ambient RNA time: 2478.3 sec
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
2253    2160    2018    2106    169 16941
[vireo] All done: 52 min 13.5 sec

with --callAmbientRNAs -M 200 :

$ cat nohup.cellsnp_vireo_v2_SSt_L1_4_ambient_nInit200.out
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-3509645.1, -3508161.5, -3507201.0]
[vireo] allelic rate mean and concentrations:
[[0.117 0.275 0.468]]
[[1661061.8 1355583.4 4062505.7]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6225    6329    6656    6437
[vireo] 8145 out 60797 SNPs selected for ambient RNA detection: ELBO_gain > 53.4
[vireo] Ambient RNA time: 402.3 sec
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
2084    2156    2264    2205    168 16770
[vireo] All done: 15 min 6.0 sec

And with --callAmbientRNAs -M 1000 :

$ cat nohup.cellsnp_vireo_v2_SSt_L1_4_ambient_nInit1000.out
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-3509715.5, -3508234.1, -3506775.0]
[vireo] allelic rate mean and concentrations:
[[0.117 0.281 0.468]]
[[1686010.2 1369964.7 4023176.1]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6414    6314    6523    6396
[vireo] 8155 out 60797 SNPs selected for ambient RNA detection: ELBO_gain > 53.4
[vireo] Ambient RNA time: 399.7 sec
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
2120    2140    2208    2174    163 16842
[vireo] All done: 44 min 58.3 sec

with identical results.

I also explored the results of the first run in R and I see the following distribution:

vireo_1

vireo_2

With most unassigned cells having moderate values of prob_max and prob_doublet < 0.25

Also, the values for the unassigned cells overlap those of donor-assigned cells:


> donor_min_prob_max = min(m[!m$donor_id %in% c("unassigned","doublet"),]$prob_max)
> donor_min_prob_max
[1] 0.9
> dim( m[m$donor_id=="unassigned" & m$prob_max >= donor_min_prob_max ,])
[1] 26  8
> 
> donor_max_LLR = max(m[!m$donor_id %in% c("unassigned","doublet"),]$doublet_logLikRatio)
> donor_max_LLR
[1] -0.734
> dim( m[m$donor_id=="unassigned" & m$doublet_logLikRatio <=donor_max_LLR ,])
[1] 5229    8

26 unassigned cells have prob_max = 0.9, which, after looking into the code (io_utils.py), I don't understand how were they deemed "unassigned" because they all have n_vars > 10.

It is possible that these cells "suffered" a lot during the processing and there is a lot of ambient RNA floating around. How should I deal with all this? Should I just use a lower prob_max threshold and maybe use a different doublet finder software down the line?

PS: running vireo with --noDoublet leaves 4,726 unassigned cells. Oh, and in the previous runs, 6,371 unassigned cells have their best_singlet not in their own best_doublet list.

huangyh09 commented 5 months ago

Hi, thanks for sending the detailed info (& trust).

The first thing I noticed is the unusual allele rate: [0.117 0.281 0.468], instead of around [0, 0.5, 1]. After reading your commands, one thought came into my mind:

To fix it, you can try randomly selecting half of the variants and flipping the ALT and REF alleles, for example by

Hope this may help you.

Yuanhua

txemaheredia commented 4 months ago

Thanks for the suggestions.

I tried a (crude version of) what you suggested flipping the ALT/REF for half the variants and got similarly bad results with a flipped allele rate:

[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 60797 variants.
[vireo] lower bound ranges [-1900092.9, -1899299.3, -1898714.2]
[vireo] allelic rate mean and concentrations:
[[0.551 0.858 1.   ]]
[[2305018.1 1158633.8 3615499.2]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6444    6317    6217    6669
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
1187    1165    1126    1220    35      20914
[vireo] All done: 6 min 10.8 sec

I have delved a bit into this, and it seems that these samples are first generation hybrids between C57BL6 and FVB mice strains. That means that vireo's underlying allele rate assumption will never be true for these samples.

I tried downloading the strain-specific VCFs from the Mouse Genomes Project and limit the analysis to either the merged SNV set, or the intersection SNV set between both strains.

Merged:

[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 28266 variants.
[vireo] lower bound ranges [-2385086.2, -2384185.6, -2382473.9]
[vireo] allelic rate mean and concentrations:
[[0.243 0.509 0.693]]
[[ 686241.5 3465441.8  501250.7]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6267    6582    6476    6321
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
1549    1659    1635    1542    182 19080
[vireo] All done: 3 min 41.6 sec

Decent amount of SNV, bad allelic rates, poor classification.

Intersect:

[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 280 variants.
[vireo] lower bound ranges [-28465.8, -26967.1, -26680.3]
[vireo] allelic rate mean and concentrations:
[[0.001 0.307 0.897]]
[[ 3680.8 46416.6  4968.6]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6431    6391    6392    6432
[vireo] final donor size:
donor0  donor1  donor2  donor3  unassigned
12  19  20  20  25576
[vireo] All done: 0 min 8.8 sec

Extremely low number of SNV, better allelic ratios, no donor classification power.

Can you think of any way to make vireo work with these kind of samples? Otherwise, do you know of any other similar tool that could make them work?

Thank you very much.

huangyh09 commented 4 months ago

For your last strategy, maybe you can double-check with a similar one used in this paper:

Demultiplexing of 10x Data Genotyping information for the C3H_HeJ, CAST_EiJ and C57BL_6NJ mouse strains were extracted from the Mouse Genome Project (Keane et al., 2011) dataset. The SNPs were filtered to identify those which w ere heterozygous in at least one of the three strains (25.7 million in total). These were used as candidates to genotype all of the cells in each pool using cellSNP v0.1.7 (Huang et al., 2019), parameters “–minMAF 0.1–minCOUNT 20.” 76,000 to 111,000 informative SNPs were obtained from the pooled scRNA-seq data, these were utilized further in Vireo v0.2.2 (Huang et al., 2019) in the genotype reference free mode with parameters “-N 4 -M 100” to de multiplex the pools. The estimated genotypes for these strains were mapped back to the three known genotypes from the Mouse Genome Project to link the cell lines to their parental mouse strain.

txemaheredia commented 4 months ago

I've just tried that:

bcftools view  -i 'GT="het"' ${invcf} | bgzip > $outvcf
$ zcat merged_heterozygous_FVB_C57BL6NJ.vcf.gz | grep -v "^#" | wc -l
573115
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 886 variants.
[vireo] lower bound ranges [-72189.6, -70781.1, -69979.7]
[vireo] allelic rate mean and concentrations:
[[0.018 0.33  0.918]]
[[ 18630.4 107414.   10785.6]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6463    6483    6323    6378
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
87      77      82      54      4       25343
[vireo] All done: 0 min 13.1 sec

Similarly to using only the "intersected" SNV, using these variants give good allelic ratios, but they are simply not enough of them to classify donors.

vireo_1_heterozygous

vireo_2_heterozygous

> m %>% 
     group_by(donor_id) %>% 
     summarize(min = min(n_vars), 
                         mean = mean(n_vars),
                         median = median(n_vars), 
                         max = max(n_vars))
# A tibble: 6 × 5
  donor_id     min  mean median   max
  <chr>      <int> <dbl>  <dbl> <int>
1 donor0        10 20.1    17      46
2 donor1        10 21.9    20      58
3 donor2        10 19.5    17      54
4 donor3        10 22.1    20.5    48
5 doublet       29 36.8    39      40
6 unassigned     0  3.85    3      61
yuantiaotiao commented 1 month ago

I've just tried that:

bcftools view  -i 'GT="het"' ${invcf} | bgzip > $outvcf
$ zcat merged_heterozygous_FVB_C57BL6NJ.vcf.gz | grep -v "^#" | wc -l
573115
[vireo] Loading cell folder ...
[vireo] Demultiplex 25647 cells to 4 donors with 886 variants.
[vireo] lower bound ranges [-72189.6, -70781.1, -69979.7]
[vireo] allelic rate mean and concentrations:
[[0.018 0.33  0.918]]
[[ 18630.4 107414.   10785.6]]
[vireo] donor size before removing doublets:
donor0  donor1  donor2  donor3
6463    6483    6323    6378
[vireo] final donor size:
donor0  donor1  donor2  donor3  doublet unassigned
87      77      82      54      4       25343
[vireo] All done: 0 min 13.1 sec

Similarly to using only the "intersected" SNV, using these variants give good allelic ratios, but they are simply not enough of them to classify donors.

vireo_1_heterozygous

vireo_2_heterozygous

> m %>% 
     group_by(donor_id) %>% 
     summarize(min = min(n_vars), 
                         mean = mean(n_vars),
                         median = median(n_vars), 
                         max = max(n_vars))
# A tibble: 6 × 5
  donor_id     min  mean median   max
  <chr>      <int> <dbl>  <dbl> <int>
1 donor0        10 20.1    17      46
2 donor1        10 21.9    20      58
3 donor2        10 19.5    17      54
4 donor3        10 22.1    20.5    48
5 doublet       29 36.8    39      40
6 unassigned     0  3.85    3      61

Hello, I have also encountered this problem. Have you solved it

txemaheredia commented 1 month ago

No. We ended up considering these samples a lost cause and we threw them away.

We focused our analysis on a different set of samples that had a "better genetic background" and were only a mixture of 2 animals. We were able to use vireo + souporcell + sex gene information to demultiplex those samples.

huangyh09 commented 1 month ago

No. We ended up considering these samples a lost cause and we threw them away.

Sorry to hear this. If you want to share this data (email me the link yuanhua@hku.hk), I may give it a try when I have time and see if there is anything we can help.

Yuanhua