single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

Incorrect output when using BD Rhapsody / STAR solo input #79

Closed sbamopoulos closed 1 year ago

sbamopoulos commented 1 year ago

Hello cellsnp-lite team,

I am trying to run a preprocessing script of the numbat R package (pileup_and_phase.R), which internally calls cellsnp-lite with the following command:

cellsnp-lite -s file.BAM -b barcodes.txt -O pileup/01 -R test.vcf -p 32 --minMAF 0 --minCOUNT 2 --UMItag MR --cellTAG CB

The test vcf file contains the first 1000 snps from genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf (just for testing):

The output cellSNP.base.vcf looks like this:

fileformat=VCFv4.2

CHROM POS ID REF ALT QUAL FILTER INFO

chr1 629218 . A G . PASS AD=3;DP=70;OTH=0 chr1 629482 . T C . PASS AD=0;DP=2;OTH=0 chr1 629626 . T C . PASS AD=0;DP=2;OTH=0 chr1 629906 . C T . PASS AD=7;DP=6689;OTH=21 chr1 630026 . C T . PASS AD=1;DP=6;OTH=0 chr1 630084 . T C . PASS AD=0;DP=4;OTH=0 chr1 630110 . T C . PASS AD=2;DP=2;OTH=1 chr1 630128 . G A . PASS AD=0;DP=3;OTH=0

I am assuming that there should be one column per cell barcode, which is not the case

Some details to the BAM file used, sequencing was done on the BD Rhapsody platform and alignment with STAR solo. It is a multiplexed BAM file, where CB:Z: Tag denotes cell barcodes, ST:Z denotes the samples (01-12) and MR the UMI. Aligment was done on Ensembl GRCh38 (GENCODE 29)

The barcode is a plain text file that has one barcode per line (they are numbers in BD Rhapsody), like so: 850570 243761 39999 360647 589619

Is there a specificiation for the BAM or barcode file that is not apparent to me through reading the documentation? I went over the cellsnp code, which is in python and a little more readable for me and it seems that if cell barcodes are provides the sample IDs are skipped and I would expect one column per barcode in the vcf file. However this is not the case. Does the cell barcode file need a specific format? Does cellsnp-lite expect a TAG that is not defined in my BAM file?

Any and all help is greatly appreciated!

Best Stefan

hxj5 commented 1 year ago

Hi Stefan,

thanks for the feedback and detailed information. The command line you provided indeed can not output one column per barcode in the VCF, but it would output the sparse matrices (AD, DP) which I suppose are sufficient for Numbat modelling. If you still want to obtain one column per barcode VCF, you may run cellsnp-lite adding the parameter --genotype, which would output an additional VCF file cellSNP.cells.vcf containing the per-barcode information.

Best Xianjie

sbamopoulos commented 1 year ago

Hi Xianjie,

thank you for your speedy reply and apologies for my late response. I do not require a column per barcode, this was an error on my part. The script I used fails further downstream, due to another issue. You can close this issue.

Best Stefan