single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

hg38 chromsome names in 1000G VCF and 10X BAM #84

Closed jnktsj closed 1 year ago

jnktsj commented 1 year ago

Hi there, thanks for developing a great tool! I have two quick (and dumb 😓) questions about how cellsnp-lite handles chromosome names in a region VCF and BAM for mode 1a:


My 10X bam has "chr" prefix for the chromosomes, and the names are lexicographically sorted.

# header
@HD VN:1.4  SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr10    LN:133797422
@SQ SN:chr11    LN:135086622
…
# reads
A01249:315:HLW5MDSX2:3:2358:31322:22576 99  chr1    10002   1   74M =   10007   105 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTAA  FFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFF:,FF:F,:FF::F:FF:F,,FFF:F,:F::F,FFF,,,FF  NH:i:3  HI:i:1  AS:i:162    nM:i:5  RG:Z:pM5600_1_BM_CD138neg_GEX_5:0:1:HLW5MDSX2:3 RE:A:I  xf:i:0  CR:Z:GATGAAATCAGCCTAA   CY:Z:FFFFFFFFFFFFFFFF   CB:Z:GATGAAATCAGCCTAA-1 UR:Z:CCCTAACCCTUY:Z:FFFFFFFFFF  UB:Z:CCCTAACCCT
A01249:315:HLW5MDSX2:3:1175:17309:17942 99  chr1    10002   1   74M =   10007   105 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCAAA  FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FF:FF,FF,FF::,,FF:FFFFF,:,,,:  NH:i:3  HI:i:1  AS:i:162    nM:i:5  RG:Z:pM5600_1_BM_CD138neg_GEX_5:0:1:HLW5MDSX2:3 RE:A:I  xf:i:0  CR:Z:CGGACACTAACCCTAA   CY:Z:FFFFFFFFFF:FFFFF   UR:Z:CCCTAACCCT UY:Z:FFFFFFFFFF UB:Z:CCCTAACCCT
A01249:315:HLW5MDSX2:3:1265:9986:6417   99  chr1    10003   255 74M =   10006   103 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFF,FFFFFF,FF,F,FF:F  NH:i:1  HI:i:1  AS:i:158    nM:i:7  RG:Z:pM5600_1_BM_CD138neg_GEX_5:0:1:HLW5MDSX2:3 RE:A:I  xf:i:4  CR:Z:ACGCCGAAGAAACCTA   CY:Z:FFFFFFFFFFFFFFFF   CB:Z:ACGCCGAAGAAACCTA-1 UR:Z:CCTAACCCTAUY:Z:FFFFFFFFFF  UB:Z:CCTAACCCTA
…

genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz doesn't have the "chr" prefix, and these chromosomes are sorted numerically.

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150218
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
##source=1000GenomesPhase3Pipeline
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=243199373>
##contig=<ID=3,assembly=b37,length=198022430>
…
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       11008   rs575272151     C       G       100     PASS    AF=0.0880591
1       11012   rs544419019     C       G       100     PASS    AF=0.0880591
1       13116   rs62635286      T       G       100     PASS    AF=0.0970447
…

Thanks in advance for your input!

hxj5 commented 1 year ago

Hi, thanks for the good questions. (1) For chromosome names: cellsnp-lite would internally remove the "chr" prefix (if available) for both BAM and VCF records after loading them. Therefore, you do not need to tweak the chromosome names in the two files if they only differ in the "chr" prefix. (2) you do not need to sort the chromosomes as their order in both files do not matter, as long as the BAM records have been sorted by coordinates, e.g,. with samtools sort, and there is an BAM index (.bai) file.

jnktsj commented 1 year ago

Hi Xianjie, thank you so much for getting back to me so quickly! I appreciate all your clear answers. I'm closing this issue.