statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
116 stars 25 forks source link

Different Chromosomal Notation in VCF and BAM Files #65

Open katimbach opened 4 years ago

katimbach commented 4 years ago

Hello,

I am currently running into a bit of issues running demuxlet*, particularly in regards to the non-uniformity of the BAM and VCF files. I manually created my VCF file from genotyping data from 6 different patients using plink. My BAM files were generated from 10X, and when first trying to feed them into demuxlet I ran into errors due to different chromosome ordering (10X generates lexicographical ordering of chromosomes). I was able to reorder the chromosomes in my vcf file to match using: cat merged_vcf.vcf | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' > merged_sorted_vcf.vcf

However, looking at the chromosome labels of each file I find that while 10X labels chromosomes as MT, X, Y plink uses 22-26 (http://seqanswers.com/forums/showthread.php?t=70046). Will this pose further issue when trying to run demuxlet? And if so how can I relabel these in my VCF file??

katimbach commented 4 years ago

Update: even after reordering my vcf file I am still experiencing the following error: FATAL ERROR - [E:int32_t main(int32_t, char**)] Your VCF/BCF files and SAM/BAM/CRAM files have different ordering of chromosomes. SAM/BAM/CRAM file has 19 before 2, but VCF/BCF file has 19 after 2

However, when I examine the order of the vcf I get:

fileformat=VCFv4.2

fileDate=20200513

source=PLINKv1.90

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional

FORMAT=

CHROM

0 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 3 4 5 6 7 8 9

and when I examine the order of the BAM with samtools I get this: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 MT X Y KI270728.1 KI270727.1 KI270442.1 KI270729.1 GL000225.1 KI270743.1 GL000008.2 GL000009.2 KI270747.1 KI270722.1 GL000194.1 KI270742.1 GL000205.2 GL000195.1 KI270736.1 KI270733.1 GL000224.1 GL000219.1 KI270719.1 GL000216.2 KI270712.1 KI270706.1 KI270725.1 KI270744.1 KI270734.1 GL000213.1 GL000220.1 KI270715.1 GL000218.1 KI270749.1 KI270741.1 GL000221.1 KI270716.1 KI270731.1 KI270751.1 KI270750.1 KI270519.1 GL000214.1 KI270708.1 KI270730.1 KI270438.1 KI270737.1 KI270721.1 KI270738.1 KI270748.1 KI270435.1 GL000208.1 KI270538.1 KI270756.1 KI270739.1 KI270757.1 KI270709.1 KI270746.1 KI270753.1 KI270589.1 KI270726.1 KI270735.1 KI270711.1 KI270745.1 KI270714.1 KI270732.1 KI270713.1 KI270754.1 KI270710.1 KI270717.1 KI270724.1 KI270720.1 KI270723.1 KI270718.1 KI270317.1 KI270740.1 KI270755.1 KI270707.1 KI270579.1 KI270752.1 KI270512.1 KI270322.1 GL000226.1 KI270311.1 KI270366.1 KI270511.1 KI270448.1 KI270521.1 KI270581.1 KI270582.1 KI270515.1 KI270588.1 KI270591.1 KI270522.1 KI270507.1 KI270590.1 KI270584.1 KI270320.1 KI270382.1 KI270468.1 KI270467.1 KI270362.1 KI270517.1 KI270593.1 KI270528.1 KI270587.1 KI270364.1 KI270371.1 KI270333.1 KI270374.1 KI270411.1 KI270414.1 KI270510.1 KI270390.1 KI270375.1 KI270420.1 KI270509.1 KI270315.1 KI270302.1 KI270518.1 KI270530.1 KI270304.1 KI270418.1 KI270424.1 KI270417.1 KI270508.1 KI270303.1 KI270381.1 KI270529.1 KI270425.1 KI270396.1 KI270363.1 KI270386.1 KI270465.1 KI270383.1 KI270384.1 KI270330.1 KI270372.1 KI270548.1 KI270580.1 KI270387.1 KI270391.1 KI270305.1 KI270373.1 KI270422.1 KI270316.1 KI270340.1 KI270338.1 KI270583.1 KI270334.1 KI270429.1 KI270393.1 KI270516.1 KI270389.1 KI270466.1 KI270388.1 KI270544.1 KI270310.1 KI270412.1 KI270395.1 KI270376.1 KI270337.1 KI270335.1 KI270378.1 KI270379.1 KI270329.1 KI270419.1 KI270336.1 KI270312.1 KI270539.1 KI270385.1 KI270423.1 KI270392.1 KI270394.1

It seems that in both cases 19 comes before 2, so I don't understand why this error is still emerging.

hyunminkang commented 4 years ago

Your contig has to be named according to the chromosome names. Currently VCF and BAM are not compatible.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Wed, May 20, 2020 at 2:23 PM katimbach notifications@github.com wrote:

Update: even after reordering my vcf file I am still experiencing the following error: FATAL ERROR - [E:int32_t main(int32_t, char**)] Your VCF/BCF files and SAM/BAM/CRAM files have different ordering of chromosomes. SAM/BAM/CRAM file has 19 before 2, but VCF/BCF file has 19 after 2

However, when I examine the order of the vcf I get:

fileformat=VCFv4.2

fileDate=20200513

source=PLINKv1.90

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional

FORMAT=

CHROM

0 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 3 4 5 6 7 8 9

and when I examine the order of the BAM with samtools I get this: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 MT X Y KI270728.1 KI270727.1 KI270442.1 KI270729.1 GL000225.1 KI270743.1 GL000008.2 GL000009.2 KI270747.1 KI270722.1 GL000194.1 KI270742.1 GL000205.2 GL000195.1 KI270736.1 KI270733.1 GL000224.1 GL000219.1 KI270719.1 GL000216.2 KI270712.1 KI270706.1 KI270725.1 KI270744.1 KI270734.1 GL000213.1 GL000220.1 KI270715.1 GL000218.1 KI270749.1 KI270741.1 GL000221.1 KI270716.1 KI270731.1 KI270751.1 KI270750.1 KI270519.1 GL000214.1 KI270708.1 KI270730.1 KI270438.1 KI270737.1 KI270721.1 KI270738.1 KI270748.1 KI270435.1 GL000208.1 KI270538.1 KI270756.1 KI270739.1 KI270757.1 KI270709.1 KI270746.1 KI270753.1 KI270589.1 KI270726.1 KI270735.1 KI270711.1 KI270745.1 KI270714.1 KI270732.1 KI270713.1 KI270754.1 KI270710.1 KI270717.1 KI270724.1 KI270720.1 KI270723.1 KI270718.1 KI270317.1 KI270740.1 KI270755.1 KI270707.1 KI270579.1 KI270752.1 KI270512.1 KI270322.1 GL000226.1 KI270311.1 KI270366.1 KI270511.1 KI270448.1 KI270521.1 KI270581.1 KI270582.1 KI270515.1 KI270588.1 KI270591.1 KI270522.1 KI270507.1 KI270590.1 KI270584.1 KI270320.1 KI270382.1 KI270468.1 KI270467.1 KI270362.1 KI270517.1 KI270593.1 KI270528.1 KI270587.1 KI270364.1 KI270371.1 KI270333.1 KI270374.1 KI270411.1 KI270414.1 KI270510.1 KI270390.1 KI270375.1 KI270420.1 KI270509.1 KI270315.1 KI270302.1 KI270518.1 KI270530.1 KI270304.1 KI270418.1 KI270424.1 KI270417.1 KI270508.1 KI270303.1 KI270381.1 KI270529.1 KI270425.1 KI270396.1 KI270363.1 KI270386.1 KI270465.1 KI270383.1 KI270384.1 KI270330.1 KI270372.1 KI270548.1 KI270580.1 KI270387.1 KI270391.1 KI270305.1 KI270373.1 KI270422.1 KI270316.1 KI270340.1 KI270338.1 KI270583.1 KI270334.1 KI270429.1 KI270393.1 KI270516.1 KI270389.1 KI270466.1 KI270388.1 KI270544.1 KI270310.1 KI270412.1 KI270395.1 KI270376.1 KI270337.1 KI270335.1 KI270378.1 KI270379.1 KI270329.1 KI270419.1 KI270336.1 KI270312.1 KI270539.1 KI270385.1 KI270423.1 KI270392.1 KI270394.1

It seems that in both cases 19 comes before 2, so I don't understand why this error is still emerging.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/65#issuecomment-631644376, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OJWRNK4RI54RI7FRXTRSQNZTANCNFSM4NGFNONA .

katimbach commented 4 years ago

Can I mediate this simply by rearranging the contigs to be in the same order as the bam file with http://samtools.github.io/bcftools/bcftools.html#reheader ? And will the naming of autosomal chromosomes between the files cause issues, as mentioned in the original question? If so how do I rename them back to X, Y, MT etc

guijie2015 commented 3 years ago

Update: even after reordering my vcf file I am still experiencing the following error: FATAL ERROR - [E:int32_t main(int32_t, char**)] Your VCF/BCF files and SAM/BAM/CRAM files have different ordering of chromosomes. SAM/BAM/CRAM file has 19 before 2, but VCF/BCF file has 19 after 2

However, when I examine the order of the vcf I get:

fileformat=VCFv4.2

fileDate=20200513

source=PLINKv1.90

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional

FORMAT=

CHROM

0 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 3 4 5 6 7 8 9

and when I examine the order of the BAM with samtools I get this: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 MT X Y KI270728.1 KI270727.1 KI270442.1 KI270729.1 GL000225.1 KI270743.1 GL000008.2 GL000009.2 KI270747.1 KI270722.1 GL000194.1 KI270742.1 GL000205.2 GL000195.1 KI270736.1 KI270733.1 GL000224.1 GL000219.1 KI270719.1 GL000216.2 KI270712.1 KI270706.1 KI270725.1 KI270744.1 KI270734.1 GL000213.1 GL000220.1 KI270715.1 GL000218.1 KI270749.1 KI270741.1 GL000221.1 KI270716.1 KI270731.1 KI270751.1 KI270750.1 KI270519.1 GL000214.1 KI270708.1 KI270730.1 KI270438.1 KI270737.1 KI270721.1 KI270738.1 KI270748.1 KI270435.1 GL000208.1 KI270538.1 KI270756.1 KI270739.1 KI270757.1 KI270709.1 KI270746.1 KI270753.1 KI270589.1 KI270726.1 KI270735.1 KI270711.1 KI270745.1 KI270714.1 KI270732.1 KI270713.1 KI270754.1 KI270710.1 KI270717.1 KI270724.1 KI270720.1 KI270723.1 KI270718.1 KI270317.1 KI270740.1 KI270755.1 KI270707.1 KI270579.1 KI270752.1 KI270512.1 KI270322.1 GL000226.1 KI270311.1 KI270366.1 KI270511.1 KI270448.1 KI270521.1 KI270581.1 KI270582.1 KI270515.1 KI270588.1 KI270591.1 KI270522.1 KI270507.1 KI270590.1 KI270584.1 KI270320.1 KI270382.1 KI270468.1 KI270467.1 KI270362.1 KI270517.1 KI270593.1 KI270528.1 KI270587.1 KI270364.1 KI270371.1 KI270333.1 KI270374.1 KI270411.1 KI270414.1 KI270510.1 KI270390.1 KI270375.1 KI270420.1 KI270509.1 KI270315.1 KI270302.1 KI270518.1 KI270530.1 KI270304.1 KI270418.1 KI270424.1 KI270417.1 KI270508.1 KI270303.1 KI270381.1 KI270529.1 KI270425.1 KI270396.1 KI270363.1 KI270386.1 KI270465.1 KI270383.1 KI270384.1 KI270330.1 KI270372.1 KI270548.1 KI270580.1 KI270387.1 KI270391.1 KI270305.1 KI270373.1 KI270422.1 KI270316.1 KI270340.1 KI270338.1 KI270583.1 KI270334.1 KI270429.1 KI270393.1 KI270516.1 KI270389.1 KI270466.1 KI270388.1 KI270544.1 KI270310.1 KI270412.1 KI270395.1 KI270376.1 KI270337.1 KI270335.1 KI270378.1 KI270379.1 KI270329.1 KI270419.1 KI270336.1 KI270312.1 KI270539.1 KI270385.1 KI270423.1 KI270392.1 KI270394.1

It seems that in both cases 19 comes before 2, so I don't understand why this error is still emerging.

Hello,I find that the chromosomes in bam files must be like chr1 ,chr2..., you can use samtools view -H $yourBAMfile | sed -e 's/SN:([0-9XY])/SN:chr\1/' -e 's/SN:MT/SN:chrM/' | samtools reheader - $yourBAMfile > output.bam to add "chr" in bam files. After that I solve the problem. However, I also find your vcf filesthe chromosomes include 0, I think its strange.