single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
73 stars 27 forks source link

All cells unassigned in 10x snRNA-seq data #64

Open ninasachdev opened 2 years ago

ninasachdev commented 2 years ago

Hello, Thank you so much for creating Vireo, it’s been very useful for us!

I am currently using cellsnp-lite and vireo to demultiplex my single cell snRNA-seq data (pool of 96 donors, all donors are genotyped). I am running the following workflow:

cellsnp-lite mode 1a

cellsnp-lite -s $BAM -b $BARCODE -O $OUT_DIR -R $DONOR_VCF -p 20 --minMAF 0.1 --minCOUNT 20

subset donor vcf

bcftools view $DONOR_VCF -S $SAMPLES -Oz -o subset.vcf.gz

vireo mode 2

vireo -c $CELL_DATA -d subset.vcf.gz -o $OUT_DIR -t GT -N $n_donor

After vireo finishes running, all cells remain unassigned. I have also tried the following, but get the same results:

Do you have any ideas on how to troubleshoot this issue? Perhaps there is something wrong with the donor VCF file I am using?

yilevine commented 2 years ago

Hi,

I also have this problem.

I used the SNP array to genotype my 4 donors and got the vcf file. For now, I am trying to demultiplex the snRNA-seq data by vireo with this vcf file.

The code I used is the same as yours. The unassigned rate is pretty high. I read the similar issue #24 posted before. I think you can try to have a look at the donor_ids.tsv file to see if there are a few SNPs used for demultiplexing.

There is another option mentioned in #24. You can change the prob_max to a lower value. By default, it is 0.9. But unfortunately, I have not found a way to set this parameter. Hope @huangyh09 could give more details.

Yile

huangyh09 commented 2 years ago

Hi both,

Thanks for sharing the issue and your experience on this. The diagnosis strategies that Yile mentioned are very good, to check if it is the low coverage of the cells. It will be good to know how many SNPs are obtained for each cell, on average. For a large pool (e.g., 96), it may need more SNPs to distinguish donors. If that's the case, you may check if the sequencing saturation is high enough (e.g., >70%), otherwise you may further sequence the library.

Here, just want to make sure that the genotype and snRNA bam are in the same genome build (e.g., hg38), right?

In your above command line, I saw you have both -d and -N. If subset.vcf.gz contains fewer donors than $n_donor, then it will go to the de-novo mode which is more difficult for a large pool. You probably can try removing -N.

P.S., thanks for trying both cellSNP and cellsnp-lite, while they should (almost) identical.

Yuanhua

ninasachdev commented 2 years ago

Hi Yuanhua and Yile,

Thank you so much for your helpful responses!

The number of SNPs per cell in donor_ids.tsv, as well as the distribution of prob_max values, are quite low. The sequencing saturation is ~65% for this library, so it's a good point that it might not be high enough to demultiplex a large pool of donors.

We checked the genome build versions of the bam file vs. genotype VCF, and appear to be different versions -- thanks for pointing this out! We'll realign our sample to the matching genome version, and hopefully that will resolve the issue.

Thank you again for your help!