Chromosome naming in reference VCF files

apredeus commented 2 years ago

Hi all,

starting with Cell Ranger reference 2020-A, chromosome naming now follows Gencode convention (chr1-chr23, chrM) instead of ensembl style (1-23, MT). It's probably worth it to update the reference VCF files provided since it'll definitely cause confusion among users.

Cheers

-- Alex

hxj5 commented 2 years ago

Hi Alex,

Thanks for the feedback. Updating the file in place may cause confusion for users who use older version of reference conventions. Adding new reference VCF file for new conventions is an option, but it could be redundant if there are many versions of conventions. Alternatively, "bcftools annotate --rename-chrs" tool could be used to update the chromosome names.

Cheers, Xianjie

apredeus commented 2 years ago

Thank you! No worries, just wanted to give a heads-up. Thanks for pointing out the bcftools method too, didn't know about it!

apredeus commented 2 years ago

Hi @hxj5 - just wanted to ask a quick question.

I just ran the conversion of the reference VCF (which quite clearly worked - all main chromosomes are now renamed to chr1-chr22), and then ran cellSNP in "mode 1" (using this reference VCF) with a BAM file with the same naming convention (chr1 etc)

The results, however, are empty - and the header of cellSNP.cells.vcf.gz looks like this:

##fileformat=VCFv4.2
##source=cellSNP_v0.3.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=.,Description="Filter info not available">
##INFO=<ID=DP,Number=1,Type=Integer,Description="total counts for ALT and REF">
##INFO=<ID=AD,Number=1,Type=Integer,Description="total counts for ALT">
##INFO=<ID=OTH,Number=1,Type=Integer,Description="total counts for other bases from REF and ALT">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="total counts for ALT and REF">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="total counts for ALT">
##FORMAT=<ID=OTH,Number=1,Type=Integer,Description="total counts for other bases from REF and ALT">
##FORMAT=<ID=ALL,Number=5,Type=Integer,Description="total counts for all bases in order of A,C,G,T,N">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=Y>

Do you have any idea how it might have happened? Can't seem to find the way to debug this... Thank you in advance!

P.S. My cellSNP is version 0.3.2 installed with pip.

hxj5 commented 2 years ago

Hi, for the "empty" VCF, could you share the log information, especially the error message?

I was also wondering what the type of the data is. Is it scRNA, or scDNA/scATAC? sometimes the empty output is due to improper setting of --UMItag for scDNA/scATAC data.

apredeus commented 2 years ago

The data is single nucleus RNA-seq aligned with STARsolo. There are no obvious errors in the log file, here's the concatenated version:

Processing sample Pla_HDBR11923125..
[cellSNP] mode 1: fetch given SNPs in 67849 single cells.
[cellSNP] loading the VCF file for given SNPs ...
[cellSNP] fetching 7352497 candidate variants ...
2.00% positions processed.
...............................
100.00% positions processed.

[cellSNP] fetched 7352497 variants, now merging temp files ... 
[cellSNP] 38 lines in final vcf file
[cellSNP] All done: 376 min 45.6 sec

hxj5 commented 2 years ago

Thanks for the information. seems no obvious error. Could you share your command line and a few lines of 1) the BAM records 2) the VCF file and 3) the barcode file?

apredeus commented 2 years ago

I've extracted the bam file of chromosome 22 of my sample. Here are the links to the bam file, barcode list, and the reference VCF:

https://www.dropbox.com/s/eynica8ftbzsdc4/barcodes.tsv https://www.dropbox.com/s/69p16aucuno39p0/chr22.bam https://www.dropbox.com/s/nd4oiiw9adzlwne/sorted.vcf.gz

Thank you for your help!

apredeus commented 2 years ago

Command line was as follows:

cellSNP -s $TAG.bam -b $TAG.cut2k_barcodes.tsv -O $TAG.cellsnp.out -R $REF -p 16 --minMAF 0.1 --minCOUNT 20

where REF is the reference VCF file.

hxj5 commented 2 years ago

Hi, seems the bam file does not contain the UR tag, which is the default UMI tag when setting --UMItag Auto and barcode file is given. You may try setting --UMItag UB as the bam file contains the UB tag.

apredeus commented 2 years ago

Thank you very much! UB is the tag for error-corrected UMI, while UR is one for raw. I guess that's the difference between Cell Ranger and STARsolo-generated BAM files.

I'll let you know if it worked.

apredeus commented 2 years ago

It did! 👍 (Totally forgot about this discussion.)

Thanks again for your help.

single-cell-genetics / cellSNP

Chromosome naming in reference VCF files #23