single-cell-genetics / cellSNP

Pileup biallelic SNPs from single-cell and bulk RNA-seq data
Apache License 2.0
74 stars 11 forks source link

Empty AD&DP mtx #20

Closed AnjaliC4 closed 3 years ago

AnjaliC4 commented 3 years ago

Hi, Thanks for this nice tool!

I am trying to use this tool to pile up SNPs on chrX here is my code: cellSNP -s $BAM_file -b $BARCODE_list -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --chrom chrX

Both AD.matrix and DP.mtx are empty. I suspect that it maybe because my bam file does not have -CB as sam tag but rather in the read name. Please see attached my bam file. Could you please suggest a workaround or if there was some other underlying problem? Thanks!

bam (1).txt

huangyh09 commented 3 years ago

Thanks for sharing the issue and the sample bam. I think you are right that the main issue is that cellsnp requires cell tag; it could be CB or RG, or anything else, but unfortunately, it doesn't support extracting from read name. So, you need to find a way to add cell tag from the the read name.

Yuanhua

AnjaliC4 commented 3 years ago

Hi, I figured out a way to add barcode tag to my bam file, but I am still getting empty files. However, the -standard_output file says that pile up was successful. Could you please find the error, should I try lowering the thresholds?

Screen Shot 2021-01-27 at 3 26 20 PM

This is my code to pile up chrX: (I am trying to test heterozygosity on chrX) cellSNP -s $BAM_file -b $BARCODE_list -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --chrom chrX this is my bam file:

Screen Shot 2021-01-27 at 3 29 59 PM
huangyh09 commented 3 years ago

Hi, good to know you figured out a way to add cell barcode. By default, it also assumes UMI is available, but you data seems not. You can turn off the UMI, but count the reads directly by adding --UMItag None.

In case it is too slow, you could try out new implementation with C/C++; both should give identical results. https://github.com/single-cell-genetics/cellsnp-lite

Yuanhua

AnjaliC4 commented 3 years ago

Thanks, that worked! I am fairly new to this analysis. I have single-cell DNA data multiplexed males and females. My hypothesis was to call SNPs on chrX and separate females based on heterozygosity. My command: cellSNP -s $BAM_file -b $BARCODE_list -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 20 --chrom chrX --UMItag None Could I simply count the number of reference+ alternate alleles on each barcode/cell for this analysis using cellSNP.tag.DP.mtx? With higher count being females. Moreover, with above parameters I am getting no SNPs for many cells, could I reduce minMAP and/or minCount? Please let me know should you have any suggestions. Thank you very much!

huangyh09 commented 3 years ago

Thanks. For cells with no detected SNPs, yes you can lower the minMAP (e.g., to 0.03) and minCount (e.g., to 5), but I guess it may not help substantially. Another strategy is to include some (or all) autosomes. They can be used to cluster cells with same genotype, then you can use chrX to identify the gender.

Counting the reference+ alternate alleles may not be an ideal way, but you can use the genotype probability of heterozygosity on chrX. An alternative (and probably better) way is using Vireo to do it automatically on either chrX only or combination with autosomes. You can simply specify -C to your cellsnp output folder. Afterwards, you can check the chrX heterozygosity of the estimated donor genotype, and see if the heterozygosity is purely in one donor. Let me know how it works.

AnjaliC4 commented 3 years ago

Thanks, Yuanhua. I am trying these strategies and hopefully will get back to you with something interesting soon. Although, I am trying Vireo as well, but just so that I know my understanding of cellsnp output is correct: If I use AD matrix count for each cell divided by DP matrix count, it would give me allelic ratio (example as below), so would a higher allelic ratio correspond to heterozygosity?

   snp   depth     allelic_ratio    barcode

1 0 4 0.00000000 1 2 0 3 0.00000000 2 3 2 8 0.25000000 3 4 6 33 0.18181818 4

I am wondering what you mean by "use the genotype probability of heterozygosity on chrX", not sure how to calculate genotype probability of heterozygosity on ChrX using cellSNP output? Please advise.

I tried Vireo with chrX, chrY, chr20 and out of 4000 cells, it assigned ~300 to one donor and ~400 to another. I will repeat with all the chromosomes. Although again, how do I after this check the chrX heterozygosity of the estimated donor genotype?

huangyh09 commented 3 years ago

Hi, your understanding is correct. You can use the allelic ratio, though it may have high variance due to the low coverage for some SNP. Alternatively, you could look at the cells.cellSNP.vcf.gz, which has the PL tag for phred-scaled genotype likelihoods.

Similarly, if you run vireo, it will return GT_donors.vireo.vcf.gz, where you can find the genotypes of identified donors.

AnjaliC4 commented 3 years ago

Thanks, so I am using command: bcftools query -f '%CHROM %POS[\t%GT\t%PL]\n' cellSNP.cells.vcf.gz However, I am not getting info per barcode. How could I get PL and GT tags listed per barcode in a user-friendly format. Fore reference, from the command above, I am getting something like: chrX 319211 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0/0 0,3,42 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0

I think sample names are in columns, but when I use GATK to conver Varianttotable, it throws many exceptions and doesn't work. I have tabix-indexed it. Not sure how to get a clean tsv or table with all samples and PL/GT tag. Thansks for your help resolving this.

huangyh09 commented 3 years ago

Hi, you can try vireoSNP in Python: https://github.com/single-cell-genetics/vireo/blob/master/examples/donor_match.ipynb

import numpy as np
import vireoSNP

GT_tag0 = 'PL' # common ones: GT, GP, PL
vcf_dat0 = vireoSNP.vcf.load_VCF("../data/donors.cellSNP.vcf.gz",
                                 biallelic_only=True, sparse=False, 
                                 format_list=[GT_tag0])

GPb0_var_ids = np.array(vcf_dat0['variants'])
GPb0_donor_ids = np.array(vcf_dat0['samples'])
GPb0_tensor = vireoSNP.vcf.parse_donor_GPb(vcf_dat0['GenoINFO'][GT_tag0], GT_tag0)
GPb0_tensor.shape

For R, I used vcfR before and found it it works nicely for not-too-large vcf file. https://knausb.github.io/vcfR_documentation/matrices.html

vcfR_test <- vcfR::read.vcfR(vcf_file)
gt <- vcfR::extract.gt(vcfR_test, element = 'GT', as.numeric = TRUE)

Y

AnjaliC4 commented 3 years ago

Hi Yuanhua,

This is strategy Im following using cellSNP and vireo, but I am not sure if this is correct: Out of all the barcodes in the bam file, I could recognize some male and female barcodes (~100 each), I wanted to call SNPs for these known donors and use their SNPs to demultiplex the remaining pooled donors. For this,

  1. Call SNPs for the known male and female barcodes (~100 male +100 female) separately so there are two donor files cellSNP -s $BAM_file -b $BARCODE_known -O $OUT_DIR -p 22 --minMAF 0.05 --minCOUNT 2 -R common_human_snps
  2. Call SNPs in the bam file with all the barcodes (~4000 cells)- these include the known identify donors cellSNP -s $BAM_file -b $BARCODE -O $OUT_DIR -p 22 --minMAF 0.05 --minCOUNT 2 -R common_human_snps
  3. Then I used vireo to demultiplex donors - I repeat it for both male and female donors separetly vireo -c $CELL_DATA (from step2) -d cellSNP.cells.vcf.gz (from step1) -o $OUT_DIR --forceLearnGT

I think there is something wrong with this as it is not demultiplexing any cells. Please let me know what should I correct. Thank you very much!

huangyh09 commented 3 years ago

Hi,

The issue is the step 1 doesn't give donor level genotype, but the genotype for the ~100 cells. You need to aggregate them as one donor. Similarly for the other donor. So eventually, you -d will be a VCF with two samples.

Unfortunately, cellSNP doesn't support output a donor level for subset of cells. You could either aggregate the ~100 cell VCF manually or you can subset you bam file with the ~100 cells and run cellSNP in mode2 for a pseudo-bulk setting.

Yuanhua

AnjaliC4 commented 3 years ago

Ok thanks, that clears it up, but also from what I understand, vireo -d can accept only one file, so it should be one cellSNP.cells.vcf.gz with two donor gentoype files combined. I can subset the bam file for male donor and female donor separately and call SNPs separately in pseudobulk mode 2, and then I will have two donor vcf files, but how do I combine them for inputting -d for vireo? Thanks a lot! Anjali

huangyh09 commented 3 years ago

if you subset the bam into two files, you can put them together to cellsnp, then it will return a VCF for two donors.

AnjaliC4 commented 3 years ago

Hi again,

I created a vcf file of two donors (~100 each using pseudobulk bam files of two donors) using 1000genome reference SNPs. Then I piledup SNPs in pooled donors with --minMAF 0.05 --minCOUNT 2 parameters. Further, I used Vireo and 2 donors cellSNP.cells.vcf.gz file to demultiplex the pooled samples with -N 2 --forceLearnGT, however, out of 3718 cells, only 68 were categorized into female and 33 into male. (PS: I did not filter donor file using bcftools) 665 out 509051 variants matched to donor VCF. lower bound ranges [-5184.8, -5068.7, -5007.4] [vireo] allelic rate mean and concentrations: [[0.056 0.505 0.968]] [[2254.5 8321.1 1367.4]] [vireo] donor size before removing doublets: donor0 donor1 1812 2007 [vireo] final donor size: female male unassigned 68 33 3718

Why isnt Vireo recognizing more samples? Are all the unassigned predicted as doublets?

Additionally I get this error, which I guess is only related to the plotting function:

Thanks a lot for your suggestions and patience with this issue.

/home/.local/lib/python3.7/site-packages/vireoSNP/utils/io_utils.py:17: RuntimeWarning: invalid value encountered in greater_equal if np.sum(mm_idx == mm_idx) == 0 or np.sum(mm_idx >= 0) == 0: /home/anjali5/.local/lib/python3.7/site-packages/vireoSNP/utils/io_utils.py:20: RuntimeWarning: invalid value encountered in greater_equal if np.sum(mm_idx == mm_idx) == 0 or np.sum(mm_idx >= 0) == 0: Traceback (most recent call last): File "/home/.local/bin/vireo", line 10, in sys.exit(main()) File "/home/.local/lib/python3.7/site-packages/vireoSNP/vireo.py", line 201, in main donor_GPb[idx, :, :], donor_vcf['samples']) File "/home/.local/lib/python3.7/site-packages/vireoSNP/plot/base_plot.py", line 54, in plot_GT fig = plt.figure() File "/home/.local/lib/python3.7/site-packages/matplotlib/pyplot.py", line 546, in figure **kwargs) File "/home/.local/lib/python3.7/site-packages/matplotlib/backend_bases.py", line 3337, in new_figure_manager return cls.new_figure_manager_given_figure(num, fig) File "/home/.local/lib/python3.7/site-packages/matplotlib/backends/_backend_tk.py", line 876, in new_figure_manager_given_figure window = tk.Tk(className="matplotlib") File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/tkinter/init.py", line 2023, in init self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use) _tkinter.TclError: couldn't connect to display "localhost:34.0"

huangyh09 commented 3 years ago

Hi, your procedures look reasonable. One wired thing is the too few match variants (665 out 509051 variants) between ~100 cells donors & all cells. As both are detected in the scRNA-seq, I would expect very high proportion of variants are matched. How many variants you detected from the 100 cells pool?

Also, how many donors in your whole library? Only two? If so, I think you could try vireo without donor genotype first and see how the ~100 cells distribute in the cell cluster identified by Vireo. This strategy will use more variants.

For the plotting issues, I'm not sure how it happens, as it generally work well in our environments. Maybe you could try upgrading matplotlib.

AnjaliC4 commented 3 years ago

Hi again, I tried simulating a multiplexed bam files of known non-multiplexed donors using

  1. samtools merge bam1 bam2.
  2. concatenated cell barcodes that were called on both the libraries individually (I .know it can merge some common barcodes, but wasn't sure if i could just add -1 and -2 to differentiate, since that would not be the case in an actual multiplexed library)
  3. Then used cellSNP to pileup SNPs on this merged bam using concatenated barcode file
  4. and finally vireo to demultiplex them to two donors, which shows decent accuracy.

However, I wasn't sure if samtools merge or something else in this process would keep any individual library specific flags that may have helped cellSNP/vireo in demultiplexing them, or is it truly demultiplexing based only on the called SNPs? Is samtools merge a good way to simulated multiplxed bam files? I know there is your tool synth_pool.py, but I realized it needs region/barcode file. Thanks!

Also, for your above comment: I am actually testing this on scATAC-seq library and not scRNA-seq- so don't know if thats whats causing less SNPs to be matched.

huangyh09 commented 3 years ago

Thansk for double check. It should work as it presents by onlying using SNPs' information. As mentioned in this issue https://github.com/single-cell-genetics/cellSNP/issues/10, samtoools merge can't resolve the issue that cell barcodes are overlapped between multiple samples, so the merged cell barcode may refer to multiple cells in different donors. If you want to add -1 and -2 to differentiate, you need to hack it in the bam file on the CB tag.

AnjaliC4 commented 3 years ago

Thanks for the reponse, so instead of samtools merge, if I use: python synth_pool.py -s bam1,bam2 -b barcode1,barcode2 -r 1000genome.reference.SNPs -o directory From this, I should get a pooled bam file, then to generate vcf file for vireo, I should run cellSNP on mode 2- cellSNP -s $bulkBAM -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 and feed vcf to vireo for demultiplexing of two donors? Does this make sense? Additionally, should both the barcode files have -1 suffixed? I have generated it through a different pipeline so I do not have -1 in the barcodes, although I do have CB:Z tag in the bam files.

huangyh09 commented 3 years ago

should be fine. with -1 and -2 suffix will help cellsnp-lite to distinguish them.

AnjaliC4 commented 3 years ago

Hi The vcfR you had suggested, I tried it on chrX and chrY piledup chromosomes using cellsnp-lite. however, vcfR_test <- vcfR::read.vcfR(vcf_file) gt <- vcfR::extract.gt(vcfR_test, element = 'GT', as.numeric = TRUE) or "PL" , they both give "NA" for all the cells, even though these many SNPs were called. 125731 chrX 947 chrY Please let me know if you can think of whats wrong. Thank you!!

AnjaliC4 commented 3 years ago

Hi Yuanhua,

I have a few questions:

  1. When piling up SNPs on individual chromosomes, without reference data such as 1000genome, what is the meaning of min allele frequency? Is it calculated per cell basis or per subject basis? Since every cell within an individual should have the same genotype, how is the MAF inferred when not using reference data?

  2. Since, CellSNP does not take reference genome, how does mode2 to pileup whole chromosome without reference SNPs work?

  3. This is again about Chromosome X heterozygosity, you suggested in the past that - PL <- vcfR::extract.gt(vcfR_test, element = 'PL', as.numeric = TRUE) could be used to calculate the likelihood of heterozygosity, but first thing is PL is usually 0:21:100 - something like this, so I am not sure what as.numeric is doing to these numbers- becuase that produces a single value from 0:21:100? Second how would one infer likelihood from this? I have found that using AD and DP alone are not very accurate, so could I simply count number of time GT = 0/1 in vcf file for each cells to estimate heterozygosity or is there a better way of doing this by using PL values?

I tried using Vireo to demultiplex/estimate chrX heterozygosity based on chrX alone, but it doesn't seem to work very well.

Thanks a lot for all your advise!

huangyh09 commented 3 years ago

Hi,

Thanks for following up.

  1. AF refers to the averaged AF across all cells. It aims to filter out positions for homozygous GT for all cells.
  2. The program goes through each position, and count reads for each allele, and uses the one with highest count as reference, and the second as ALT allele.
  3. PL is a standard term for "Phred-scaled likelihood". For Phred-scaled score, see more here: https://en.wikipedia.org/wiki/Phred_quality_score

I wonder maybe it's easier to have a quick zoom chat to fix the potential issues you are facing. You can email me at yuanhua@hku.hk

Yuanhua

AnjaliC4 commented 3 years ago

Hi, Dr. Yuanhua, Thanks. I have one question with cellSNP-lite posted on git, if you could please respond to that, it will be great. It was unrelated to this- so I started a new issue. For any further questions related to this issye, I will email you. Thank you for your awesome suggestions!! I will close this issue now.