Closed AnjaliC4 closed 3 years ago
Thanks for sharing the issue and the sample bam. I think you are right that the main issue is that cellsnp requires cell tag; it could be CB or RG, or anything else, but unfortunately, it doesn't support extracting from read name. So, you need to find a way to add cell tag from the the read name.
Yuanhua
Hi, I figured out a way to add barcode tag to my bam file, but I am still getting empty files. However, the -standard_output file says that pile up was successful. Could you please find the error, should I try lowering the thresholds?
This is my code to pile up chrX: (I am trying to test heterozygosity on chrX) cellSNP -s $BAM_file -b $BARCODE_list -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --chrom chrX this is my bam file:
Hi, good to know you figured out a way to add cell barcode. By default, it also assumes UMI is available, but you data seems not. You can turn off the UMI, but count the reads directly by adding --UMItag None
.
In case it is too slow, you could try out new implementation with C/C++; both should give identical results. https://github.com/single-cell-genetics/cellsnp-lite
Yuanhua
Thanks, that worked! I am fairly new to this analysis. I have single-cell DNA data multiplexed males and females. My hypothesis was to call SNPs on chrX and separate females based on heterozygosity. My command: cellSNP -s $BAM_file -b $BARCODE_list -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 20 --chrom chrX --UMItag None Could I simply count the number of reference+ alternate alleles on each barcode/cell for this analysis using cellSNP.tag.DP.mtx? With higher count being females. Moreover, with above parameters I am getting no SNPs for many cells, could I reduce minMAP and/or minCount? Please let me know should you have any suggestions. Thank you very much!
Thanks. For cells with no detected SNPs, yes you can lower the minMAP (e.g., to 0.03) and minCount (e.g., to 5), but I guess it may not help substantially. Another strategy is to include some (or all) autosomes. They can be used to cluster cells with same genotype, then you can use chrX to identify the gender.
Counting the reference+ alternate alleles may not be an ideal way, but you can use the genotype probability of heterozygosity on chrX. An alternative (and probably better) way is using Vireo to do it automatically on either chrX only or combination with autosomes. You can simply specify -C
to your cellsnp output folder. Afterwards, you can check the chrX heterozygosity of the estimated donor genotype, and see if the heterozygosity is purely in one donor. Let me know how it works.
Thanks, Yuanhua. I am trying these strategies and hopefully will get back to you with something interesting soon. Although, I am trying Vireo as well, but just so that I know my understanding of cellsnp output is correct: If I use AD matrix count for each cell divided by DP matrix count, it would give me allelic ratio (example as below), so would a higher allelic ratio correspond to heterozygosity?
snp depth allelic_ratio barcode
1 0 4 0.00000000 1 2 0 3 0.00000000 2 3 2 8 0.25000000 3 4 6 33 0.18181818 4
I am wondering what you mean by "use the genotype probability of heterozygosity on chrX", not sure how to calculate genotype probability of heterozygosity on ChrX using cellSNP output? Please advise.
I tried Vireo with chrX, chrY, chr20 and out of 4000 cells, it assigned ~300 to one donor and ~400 to another. I will repeat with all the chromosomes. Although again, how do I after this check the chrX heterozygosity of the estimated donor genotype?
Hi, your understanding is correct. You can use the allelic ratio, though it may have high variance due to the low coverage for some SNP. Alternatively, you could look at the cells.cellSNP.vcf.gz
, which has the PL tag for phred-scaled genotype likelihoods.
Similarly, if you run vireo, it will return GT_donors.vireo.vcf.gz
, where you can find the genotypes of identified donors.
Thanks, so I am using command: bcftools query -f '%CHROM %POS[\t%GT\t%PL]\n' cellSNP.cells.vcf.gz However, I am not getting info per barcode. How could I get PL and GT tags listed per barcode in a user-friendly format. Fore reference, from the command above, I am getting something like: chrX 319211 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0/0 0,3,42 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0
I think sample names are in columns, but when I use GATK to conver Varianttotable, it throws many exceptions and doesn't work. I have tabix-indexed it. Not sure how to get a clean tsv or table with all samples and PL/GT tag. Thansks for your help resolving this.
Hi, you can try vireoSNP in Python: https://github.com/single-cell-genetics/vireo/blob/master/examples/donor_match.ipynb
import numpy as np
import vireoSNP
GT_tag0 = 'PL' # common ones: GT, GP, PL
vcf_dat0 = vireoSNP.vcf.load_VCF("../data/donors.cellSNP.vcf.gz",
biallelic_only=True, sparse=False,
format_list=[GT_tag0])
GPb0_var_ids = np.array(vcf_dat0['variants'])
GPb0_donor_ids = np.array(vcf_dat0['samples'])
GPb0_tensor = vireoSNP.vcf.parse_donor_GPb(vcf_dat0['GenoINFO'][GT_tag0], GT_tag0)
GPb0_tensor.shape
For R, I used vcfR before and found it it works nicely for not-too-large vcf file. https://knausb.github.io/vcfR_documentation/matrices.html
vcfR_test <- vcfR::read.vcfR(vcf_file)
gt <- vcfR::extract.gt(vcfR_test, element = 'GT', as.numeric = TRUE)
Y
Hi Yuanhua,
This is strategy Im following using cellSNP and vireo, but I am not sure if this is correct: Out of all the barcodes in the bam file, I could recognize some male and female barcodes (~100 each), I wanted to call SNPs for these known donors and use their SNPs to demultiplex the remaining pooled donors. For this,
I think there is something wrong with this as it is not demultiplexing any cells. Please let me know what should I correct. Thank you very much!
Hi,
The issue is the step 1 doesn't give donor level genotype, but the genotype for the ~100 cells. You need to aggregate them as one donor. Similarly for the other donor. So eventually, you -d
will be a VCF with two samples.
Unfortunately, cellSNP doesn't support output a donor level for subset of cells. You could either aggregate the ~100 cell VCF manually or you can subset you bam file with the ~100 cells and run cellSNP in mode2 for a pseudo-bulk setting.
Yuanhua
Ok thanks, that clears it up, but also from what I understand, vireo -d can accept only one file, so it should be one cellSNP.cells.vcf.gz with two donor gentoype files combined. I can subset the bam file for male donor and female donor separately and call SNPs separately in pseudobulk mode 2, and then I will have two donor vcf files, but how do I combine them for inputting -d for vireo? Thanks a lot! Anjali
if you subset the bam into two files, you can put them together to cellsnp, then it will return a VCF for two donors.
Hi again,
I created a vcf file of two donors (~100 each using pseudobulk bam files of two donors) using 1000genome reference SNPs. Then I piledup SNPs in pooled donors with --minMAF 0.05 --minCOUNT 2 parameters. Further, I used Vireo and 2 donors cellSNP.cells.vcf.gz file to demultiplex the pooled samples with -N 2 --forceLearnGT, however, out of 3718 cells, only 68 were categorized into female and 33 into male. (PS: I did not filter donor file using bcftools) 665 out 509051 variants matched to donor VCF. lower bound ranges [-5184.8, -5068.7, -5007.4] [vireo] allelic rate mean and concentrations: [[0.056 0.505 0.968]] [[2254.5 8321.1 1367.4]] [vireo] donor size before removing doublets: donor0 donor1 1812 2007 [vireo] final donor size: female male unassigned 68 33 3718
Why isnt Vireo recognizing more samples? Are all the unassigned predicted as doublets?
Additionally I get this error, which I guess is only related to the plotting function:
Thanks a lot for your suggestions and patience with this issue.
/home/.local/lib/python3.7/site-packages/vireoSNP/utils/io_utils.py:17: RuntimeWarning: invalid value encountered in greater_equal
if np.sum(mm_idx == mm_idx) == 0 or np.sum(mm_idx >= 0) == 0:
/home/anjali5/.local/lib/python3.7/site-packages/vireoSNP/utils/io_utils.py:20: RuntimeWarning: invalid value encountered in greater_equal
if np.sum(mm_idx == mm_idx) == 0 or np.sum(mm_idx >= 0) == 0:
Traceback (most recent call last):
File "/home/.local/bin/vireo", line 10, in
Hi, your procedures look reasonable. One wired thing is the too few match variants (665 out 509051 variants) between ~100 cells donors & all cells. As both are detected in the scRNA-seq, I would expect very high proportion of variants are matched. How many variants you detected from the 100 cells pool?
Also, how many donors in your whole library? Only two? If so, I think you could try vireo without donor genotype first and see how the ~100 cells distribute in the cell cluster identified by Vireo. This strategy will use more variants.
For the plotting issues, I'm not sure how it happens, as it generally work well in our environments. Maybe you could try upgrading matplotlib
.
Hi again, I tried simulating a multiplexed bam files of known non-multiplexed donors using
However, I wasn't sure if samtools merge or something else in this process would keep any individual library specific flags that may have helped cellSNP/vireo in demultiplexing them, or is it truly demultiplexing based only on the called SNPs? Is samtools merge a good way to simulated multiplxed bam files? I know there is your tool synth_pool.py, but I realized it needs region/barcode file. Thanks!
Also, for your above comment: I am actually testing this on scATAC-seq library and not scRNA-seq- so don't know if thats whats causing less SNPs to be matched.
Thansk for double check. It should work as it presents by onlying using SNPs' information. As mentioned in this issue https://github.com/single-cell-genetics/cellSNP/issues/10,
samtoools merge
can't resolve the issue that cell barcodes are overlapped between multiple samples, so the merged cell barcode may refer to multiple cells in different donors. If you want to add -1 and -2 to differentiate, you need to hack it in the bam file on the CB tag.
Thanks for the reponse, so instead of samtools merge, if I use: python synth_pool.py -s bam1,bam2 -b barcode1,barcode2 -r 1000genome.reference.SNPs -o directory From this, I should get a pooled bam file, then to generate vcf file for vireo, I should run cellSNP on mode 2- cellSNP -s $bulkBAM -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 and feed vcf to vireo for demultiplexing of two donors? Does this make sense? Additionally, should both the barcode files have -1 suffixed? I have generated it through a different pipeline so I do not have -1 in the barcodes, although I do have CB:Z tag in the bam files.
should be fine. with -1 and -2 suffix will help cellsnp-lite to distinguish them.
Hi The vcfR you had suggested, I tried it on chrX and chrY piledup chromosomes using cellsnp-lite. however, vcfR_test <- vcfR::read.vcfR(vcf_file) gt <- vcfR::extract.gt(vcfR_test, element = 'GT', as.numeric = TRUE) or "PL" , they both give "NA" for all the cells, even though these many SNPs were called. 125731 chrX 947 chrY Please let me know if you can think of whats wrong. Thank you!!
Hi Yuanhua,
I have a few questions:
When piling up SNPs on individual chromosomes, without reference data such as 1000genome, what is the meaning of min allele frequency? Is it calculated per cell basis or per subject basis? Since every cell within an individual should have the same genotype, how is the MAF inferred when not using reference data?
Since, CellSNP does not take reference genome, how does mode2 to pileup whole chromosome without reference SNPs work?
This is again about Chromosome X heterozygosity, you suggested in the past that - PL <- vcfR::extract.gt(vcfR_test, element = 'PL', as.numeric = TRUE) could be used to calculate the likelihood of heterozygosity, but first thing is PL is usually 0:21:100 - something like this, so I am not sure what as.numeric is doing to these numbers- becuase that produces a single value from 0:21:100? Second how would one infer likelihood from this? I have found that using AD and DP alone are not very accurate, so could I simply count number of time GT = 0/1 in vcf file for each cells to estimate heterozygosity or is there a better way of doing this by using PL values?
I tried using Vireo to demultiplex/estimate chrX heterozygosity based on chrX alone, but it doesn't seem to work very well.
Thanks a lot for all your advise!
Hi,
Thanks for following up.
I wonder maybe it's easier to have a quick zoom chat to fix the potential issues you are facing. You can email me at yuanhua@hku.hk
Yuanhua
Hi, Dr. Yuanhua, Thanks. I have one question with cellSNP-lite posted on git, if you could please respond to that, it will be great. It was unrelated to this- so I started a new issue. For any further questions related to this issye, I will email you. Thank you for your awesome suggestions!! I will close this issue now.
Hi, Thanks for this nice tool!
I am trying to use this tool to pile up SNPs on chrX here is my code: cellSNP -s $BAM_file -b $BARCODE_list -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --chrom chrX
Both AD.matrix and DP.mtx are empty. I suspect that it maybe because my bam file does not have -CB as sam tag but rather in the read name. Please see attached my bam file. Could you please suggest a workaround or if there was some other underlying problem? Thanks!
bam (1).txt