single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
73 stars 27 forks source link

Please help me pinpoint cause of high unassigned/doublets #23

Closed mperalc closed 3 years ago

mperalc commented 3 years ago

Hello, Thank you very much for creating Vireo and cellSNP, they are very useful. I nevertheless have an issue with the deconvolution of my current dataset and I'd like your opinion of how/if I can improve the donor matching, or if it's a problem intrinsic to the sequencing of the data.

I have the following design: iPSC-differentiated cells from 19 donors, differentiated in two batches of two different differentiation conditions (condition 1 and 2). They had the same starting number of donors, but obviously the proportions might differ at the end. Library A was created from condition 1, while condition 2 led to two libraries: library B and C. Library C has 1.5x cells than library B, but are otherwise identical. The cDNA libraries were barcoded with chromium barcodes during 10x sample processing, and are 5'-end v1 kit.

I have run the deconvolution pipeline changing a few parameters in cellSNP/cellSNP-lite to see which would improve our results. In all cases I have used the merged genotype files for all donors included in the experiment (and merging only genotype files for those donors and not any other hipsci donor) for the genotype file, and for the reference SNP file ($snplist) I have altered some conditions: used 0.01 minimum MAF, subsampled that file, or used all SNPs. I have also altered the --minCOUNT or --minLEN parameters in some cases. I have used the same Vireo version and parameters throughout.

The cellSNP run was: cellSNP -s $bamFile -b $barcodeFile -o $cellsnpOutput -R $snplist -p 20 --minMAF 0.01 --minCOUNT 10 --maxFLAG 4096 The cellSNP-lite run was:

cellsnp-lite -s $bamFile -b $barcodeFile -O $cellsnpOutput -R $snplist -p 20 --minMAF [number] --minCOUNT [number] \ 
        --genotype --minLEN [number] --gzip

The Vireo run always was: vireo -c $cellsnpVCF -d $subsetFile -t GT -o $outputDirectory

$subsetfile was created as follows:

bcftools view \
        $genotypes \
        -R $cellsnpVCF \
        -S $detectedDonors \
        -Oz \
        -o $subsetFile

To (hopefully) clarify: $cellsnpVCF = cellSNP.cells.vcf.gz $genotypes = the merged genotype VCF (which in the case of no MAF filtering, was the same as the $snplist). This comes from the actual donor genotype files from the hipsci portal.

The parameters and results, along with some sequencing details from cellRanger, are detailed in the following file: 20201Feb_Hipsci_lib1_4_5_deconvolution - Sheet3.pdf

My questions are:

I hope I have provided enough information for you to help me with this issue. Thank you very much again.

Best wishes, Marta

huangyh09 commented 3 years ago

Hi Marta,

Thanks for sharing your experiment designs and results.

1) The coverage (~5000 reads / cell) is indeed a bit low. This may also result in high proportion filtering in the transcriptome analysis. Nonetheless, the demultiplexing probably can be improved by updating the snplist and donor subsetFile.

2) for donor subsetFile, why you get it from cellSNP.cells.vcf.gz? Could provide more details on what you are actually using? Do you genotype the non-multiplexed (sc-) RNA-seq for these donors? I understand the hipsci portal has good resources for the donor genotypes, e.g., imputated from SNP arrays, or WGS, or bulk RNA-seq. Are you using the donor VCF there?

3) for snplist, I think it makes more sense to filter on the donor VCF file, namely you can filter out SNPs that are the same for all donors. Based on these informative SNPs, it is fine to not include --minMAF or --minCOUNT, while --minMAF=0.01 --minCOUNT 5 are usually sensible settings, as otherwise they are not so informative for demultiplexing. But from your results, turning off the threshold actually helps the assignable rate.

As the said, the donor VCF is critical here. So, let's check this before going back to your last two questions.

Yuanhua

mperalc commented 3 years ago

Hello,

Thank you for your quick response. Sorry I was mistaken before, I have updated the main text and added the following on the actual code used to create the subsetFile:

bcftools view \
        $genotypes \
        -R $cellsnpVCF \
        -S $detectedDonors \
        -Oz \
        -o $subsetFile

To (hopefully) clarify: $cellsnpVCF = cellSNP.cells.vcf.gz $genotypes = the merged genotype VCF (which in the case of no MAF filtering, was the same as the $snplist). This comes from the actual donor genotype files from the hipsci portal.

Am I not subsetting the genotype file correctly? Should I even be subsetting it based on the cellSNP.cells regions? Thank you again and best wishes, Marta

huangyh09 commented 3 years ago

I see, thanks for clarifying this and it makes more sense. The subsetFile is fine as only SNPs in cellsnpVCF will bed used in vireo anyway.

In your result table, you varies the $snplist by donor VCF's MAF, right? This is a good trial. I think donor VCF AF>0.01 could be a good minimum filtering (usually even higher, e.g., AF>0.05), and non-filtered $snplist may be risky, though it makes a clear difference. Also, it's interesting that the random 6M SNPs give so different assignment. I wonder maybe you could including a --minMAF 0.05 in your setting to only include informative SNPs. You didn't use --minMAF in cellsnp at all, right?

On the otherhand, if your sequencing library is still available, maybe you could consider deeper sequencing. Is the sequencing saturation high?

mperalc commented 3 years ago

In your result table, you varies the $snplist by donor VCF's MAF, right? Yes, that's correct.

You didn't use --minMAF in cellsnp at all, right? For the random 6M SNP from the $snplist, it had been pre-filtered for MAF>0.01. So only SNPs above 0.01 MAF are in those 6M.

On the otherhand, if your sequencing library is still available, maybe you could consider deeper sequencing. Is the sequencing saturation high? It's not, it's around 20%. We will need to do that, if there is nothing else that we can try in cellSNP+Vireo that can improve these numbers. That's why I wanted to check with you that there was no obvious error in the files or the parameters used for running the analysis. What is the sequencing depth you would recommend for deconvolution?

I re-run the data with a minCOUNT of 5 using all SNPs, and it barely improved the results for library A. Curiously, now all cells both library B and C are considered doublets: 20201Feb_Hipsci_libA_B_C_deconvolution - updated.pdf

huangyh09 commented 3 years ago

Thanks for the clarification. In practice, your second column (9M SNPs + minCOUNT 20) is what I usually recomend. Given the low coverage and large number of donors, I would suggest keeping this setting and using more lenient cutoff on assignable cells. By default it uses prob_max > 0.9. I think it should be OK to use prob_max > 0.65 (or even 0.5) given the large number of donors. Alternatively, could you send the donor_ids.tsv to my email: yuanhua@hku.hk, I'm happy to have a look.

In the vireo paper (Fig. 2F), we tried equivalent 600 UMIs/cell in a 8 donor pool, and it still work fine, but we never have data with 19 donors at this low coverage. I'm also curious whether this reaches vireo's limit.

mperalc commented 3 years ago

I have sent you the donor_ids.tsv among the rest of the result files for each library. Thank you again.

huangyh09 commented 3 years ago

As agreed that the low coverage / saturation maybe the bottleneck, let's close the issue here. Feel free to follow up with updates.