samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
657 stars 240 forks source link

bcftools doesn't recognise some INFO fields from the vcf header #603

Closed heinin closed 7 years ago

heinin commented 7 years ago

I'm bumping into this error while using bcftools merge. Based on https://github.com/samtools/bcftools/issues/203 , I tried to remove carriage returns from the vcf file, but it didn't help. I'm using bcftools 1.4. What should I try next?

Error:

Could not parse the header line: "##SAMPLE=<ID=NORMAL,NAME=TCGA-42-2591-10A-01D-1526-09,ALIQUOT_ID=188673ae-cf55-4e58-aff7-cc1b2d55ec07,BAM_ID=f46a7f8e-766b-4cb1-97b5-bba94c37d20b>"

(for every sample)
...

[W::vcf_parse] FILTER 'SB1' is not defined in the header
[W::vcf_parse] FILTER 'DETP20' is not defined in the header
[W::vcf_parse] INFO 'DP' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SOMATIC' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SS' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SSC' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'GPV' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SPV' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'CSQ' is not defined in the header, assuming Type=String

Header of one of the VCF:

##fileformat=VCFv4.1
##fileDate=20160530
##center="NCI Genomic Data Commons (GDC)"
##gdcWorkflow=<ID=somatic-mutation-calling-workflow,Name=varscan2,Description="VarScan2 Somatic Mutation Calling",Version=1.0>
##gdcWorkflow=<ID=somatic-annotation-workflow,Name=varscan2-annotation,Description="fpfilter and VEP Annotation",Version=1.0>
##INDIVIDUAL=<NAME=TCGA-29-1699,ID=fe0e3851-d8cb-4533-9536-b4826cd25f87>
##SAMPLE=<ID=NORMAL,NAME=TCGA-29-1699-10A-01W-0633-09,ALIQUOT-ID=1aafabff-3396-4bf7-85c6-b43cd48003a0,BAM-ID=59a256bb-99fd-492f-94b4-f1a4bcbbaf00>
##SAMPLE=<ID=TUMOR,NAME=TCGA-29-1699-01A-01W-0633-09,ALIQUOT-ID=9db2682f-aa52-4b1f-b4bc-047084a9aac5,BAM-ID=fc2d449c-e94e-4506-9567-6c67b386295e>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality bases">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)">
##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred scale (0-255) derived from somatic p-value">
##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor+normal versus no variant for Germline calls">
##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor versus normal for Somatic/LOH calls">
##FILTER=<ID=str10,Description="Less than 10 or more than 90 of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">
##FILTER=<ID=IRC,Description="Unable to grab any sort of readcount for either the reference or the variant allele">
##FILTER=<ID=RLD25,Description="Difference in average clipped read length between variant and reference supporting reads is greater than 25">
##FILTER=<ID=SB1,Description="Reads supporting the variant have less than 0.01 fraction of the reads on one strand, but reference supporting reads are not similarly biased">
##FILTER=<ID=PB10,Description="Average position on read less than 0.1 or greater than 0.9 fraction of the read length">
##FILTER=<ID=MMQSD50,Description="Difference in average mismatch quality sum between variant and reference supporting reads is greater than 50">
##FILTER=<ID=DETP20,Description="Average distance of the variant base to the effective 3' end is less than 0.2">
##FILTER=<ID=NRC,Description="Unable to grab readcounts for variant allele">
##FILTER=<ID=MMQS100,Description="The average mismatch quality sum of reads supporting the variant is greater than 100">
##FILTER=<ID=MQD30,Description="Difference in average mapping quality sum between variant and reference supporting reads is greater than 30">
##FILTER=<ID=MVF5,Description="Variant allele frequency is less than 0.05">
##FILTER=<ID=MVC4,Description="Less than 4 high quality reads support the variant">
##reference=GRCh38.d1.vd1.fa
##contig=<ID=chr1,length=248956422,assembly=GRCh38.d1.vd1>
##contig=<ID=chr2,length=242193529,assembly=GRCh38.d1.vd1>
##contig=<ID=chr3,length=198295559,assembly=GRCh38.d1.vd1>
##contig=<ID=chr4,length=190214555,assembly=GRCh38.d1.vd1>
##contig=<ID=chr5,length=181538259,assembly=GRCh38.d1.vd1>
##contig=<ID=chr6,length=170805979,assembly=GRCh38.d1.vd1>
##contig=<ID=chr7,length=159345973,assembly=GRCh38.d1.vd1>
##contig=<ID=chr8,length=145138636,assembly=GRCh38.d1.vd1>
##contig=<ID=chr9,length=138394717,assembly=GRCh38.d1.vd1>
##contig=<ID=chr10,length=133797422,assembly=GRCh38.d1.vd1>
##contig=<ID=chr11,length=135086622,assembly=GRCh38.d1.vd1>
##contig=<ID=chr12,length=133275309,assembly=GRCh38.d1.vd1>
##contig=<ID=chr13,length=114364328,assembly=GRCh38.d1.vd1>
##contig=<ID=chr14,length=107043718,assembly=GRCh38.d1.vd1>
##contig=<ID=chr15,length=101991189,assembly=GRCh38.d1.vd1>
##contig=<ID=chr16,length=90338345,assembly=GRCh38.d1.vd1>
##contig=<ID=chr17,length=83257441,assembly=GRCh38.d1.vd1>
##contig=<ID=chr18,length=80373285,assembly=GRCh38.d1.vd1>
##contig=<ID=chr19,length=58617616,assembly=GRCh38.d1.vd1>
##contig=<ID=chr20,length=64444167,assembly=GRCh38.d1.vd1>
##contig=<ID=chr21,length=46709983,assembly=GRCh38.d1.vd1>
##contig=<ID=chr22,length=50818468,assembly=GRCh38.d1.vd1>
##contig=<ID=chrX,length=156040895,assembly=GRCh38.d1.vd1>
##contig=<ID=chrY,length=57227415,assembly=GRCh38.d1.vd1>
##contig=<ID=chrM,length=16569,assembly=GRCh38.d1.vd1>
##VEP=v84 cache=/var/lib/cwl/job453169568-cache/gdc-vep-cache/homo-sapiens/84-GRCh38 db=. ESP=20141103 dbSNP=146 polyphen=2.2.2 COSMIC=75 regbuild=13.0 gencode=GENCODE 22 assembly=GRCh38.p5 ClinVar=201601 genebuild=2014-07 sift=sift5.2.2 HGMD-PUBLIC=20154
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature-type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA-position|CDS-position|Protein-position|Amino-acids|Codons|Existing-variation|ALLELE-NUM|DISTANCE|STRAND|FLAGS|VARIANT-CLASS|SYMBOL-SOURCE|HGNC-ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|RefSeq|GENE-PHENO|SIFT|PolyPhen|DOMAINS|HGVS-OFFSET|GMAF|AFR-MAF|AMR-MAF|EAS-MAF|EUR-MAF|SAS-MAF|AA-MAF|EA-MAF|ExAC-MAF|ExAC-Adj-MAF|ExAC-AFR-MAF|ExAC-AMR-MAF|ExAC-EAS-MAF|ExAC-FIN-MAF|ExAC-NFE-MAF|ExAC-OTH-MAF|ExAC-SAS-MAF|CLIN-SIG|SOMATIC|PHENO|PUBMED|MOTIF-NAME|MOTIF-POS|HIGH-INF-POS|MOTIF-SCORE-CHANGE|ENTREZ|EVIDENCE">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ffc241da-85a9-47f0-9216-75c8d766733f-tumor      ffc241da-85a9-47f0-9216-75c8d766733f-normal
heinin commented 7 years ago

I was able to merge these using vcftools vcf-merge.

pd3 commented 7 years ago

The program does not like the - characters in SAMPLE header definitions. If you replace them with underscores, it should work. For example: ALIQUOT_ID instead of ALIQUOT-ID.

By the way, the program vcf-merge is obsolete and its use discouraged.

heinin commented 7 years ago

Thank you for the response! I tried with underscores, still getting the same errors:

Could not parse the header line: "##SAMPLE=<ID=NORMAL,NAME=TCGA-42-2591-10A-01D-1526-09,ALIQUOT_ID=188673ae-cf55-4e58-aff7-cc1b2d55ec07,BAM_ID=f46a7f8e-766b-4cb1-97b5-bba94c37d20b>"

pd3 commented 7 years ago

That's very odd, I cannot reproduce the error. Could you please send me the VCF header? What version of bcftools are you running?

heinin commented 7 years ago

I've tried 1.4 and 1.2. Here's the header:

`##fileformat=VCFv4.1

fileDate=20160530

center="NCI Genomic Data Commons (GDC)"

gdcWorkflow=<ID=somatic_mutation_calling_workflow,Name=varscan2,Description="VarScan2 Somatic Mutation Calling",Version=1.0>

gdcWorkflow=<ID=somatic_annotation_workflow,Name=varscan2_annotation,Description="fpfilter and VEP Annotation",Version=1.0>

INDIVIDUAL=

SAMPLE=

SAMPLE=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

FILTER=

FILTER=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

reference=GRCh38.d1.vd1.fa

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

VEP=v84 cache=/var/lib/cwl/job453169568_cache/gdc_vep_cache/homo_sapiens/84_GRCh38 db=. ESP=20141103 dbSNP=146 polyphen=2.2.2 COSMIC=75 regbuild=13.0 gencode=GENCODE 22 assembly=GRCh38.p5 ClinVar=201601 genebuild=2014-07 sift=sift5.2.2 HGMD-PUBLIC=20154

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ffc241da-85a9-47f0-9216-75c8d766733f-tumor ffc241da-85a9-47f0-9216-75c8d766733f-normal

`

pd3 commented 7 years ago

Still cannot reproduce, sorry. Can you send me the file and the command you are using directly to my email listed on the profile page?

heinin commented 7 years ago

Sure! Thank you.

pd3 commented 7 years ago

The problem has been solved offline: one or more of the input files still had a dash in the header attribute name. Closing the issue now.