split-vep CSQ extraction

zyh4482 commented 3 years ago

I would like to present a test report for this plugin because I found that someone else met the same issue before, but with few clear answer.
The purpose of this plugin is to extract columns from VEP-annotated vcf. For me, I need to extract CSQ from vep.vcf. When I tried to split csq, it failed with following log.

[W::vcf_parse] Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse_info] INFO 'PSI' is not defined in the header, assuming Type=String [W::vcf_parse_format] FORMAT 'GT' at chr1:36178715 is not defined in the header, assuming Type=String Segmentation fault (core dumped)

The code for this issue should be as follow:

bgzip in.vep.vcf (input a bgzp compressed vcf.gz for tabix indexing)
tabix -p vcf in.vep.vcf.gz (tabix indexing, although I don't know why bcftools requires it. )
bcftools +split-vep -p csq -O v -d -A tab -f '%CHROM\t%POS\t%REF\t%ALT\t%CSQ\n' in.vep.vcf.gz -o out.txt OR bcftools +split-vep -u -O v -d -A tab -f '%CHROM\t%POS\t%REF\t%ALT\t%CSQ\n' in.vep.vcf.gz -o out.txt (-p or -u option is important, this issue has been reported before)
gzip -d SRR8281${i}.vep.vcf.gz (If you'd like to keep vep.vcf instead of .gz)

The plugin could successfully split required columns.

-p and -u option result the same output. The only difference is that -u option reports additional warning: Note: ambiguous key %STRAND; using the STRAND subfield of CSQ, not the INFO/STRAND tag

But it does not have a header so I have to manually set it latter. Don't know how to paste header yet.

I would like to share this answer here in case some new comers may have this question again, despite it is very simple.

pd3 commented 3 years ago

The VCF should have a full header, bcftools need it. In some cases the program can work around the missing definitions, but not always. The command bcftools reheader can be used to update it, but the missing contig and tag definitions must be provided by the user.

zyh4482 commented 3 years ago

My vcf have full header. But after split-vep splitting csq columns, all the header disappear. Part of the first line of the output looks like: chr1 36178715 GGCCGAGGCCCGGGCGGAGCGGGAGGCGGAGGCCCGGAGGCGGGAG G - inframe_deletion MODERATE MAP7D1 ENSG00000116871 Transcript ENST00000316156 protein_coding 10/16 . . . 2260-2304 1807-1851 603-617 AEARAEREAEARRRE/- GCCGAGGCCCGGGCGGAGCGGGAGGCGGAGGCCCGGAGGCGGGAG/- . . 1 . HGNC HGNC

I'm sure my vcf is intact. Everything works fine except the header is lost after splitting.

I've written a python script to re-head the output, it is no big deal. But still no idea why this happens.

pd3 commented 3 years ago

I don't think so. The program wouldn't print the warnings about missing contig names and annotations. Please provide a test case if you believe the VCF is okay to reproduce the problem.

Regarding the output, with the -f option one tells the program to create a non-VCF text tab-delimited output, so it is to be expected to find no header on the output.

zyh4482 commented 3 years ago

Sorry for the confusion. I understand now. The "header" I mentioned above should refer to the header of tab-delimited table rather than vcf. For instance, I would like to add a header for each columns in the text table, which looks like:

CHROM Allele Consequence IMPACT SYMBOL Gene Feature_type Feature chr1 - inframe_deletion MODERATE MAP7D1 ENSG00000116871 Transcript

As for the missing names, my vcfs placed PSI and genotype(GT) info at the end of each line. But no header in vcf describe these contig names. That's why it print warnings. It is acceptable because they are not required in my case. For instance:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1

chr1 36178715 …………………………………………………………………………………………………… …………………………………………………………………………………………………………………………… ||||||;PSI=0.9049 GT 0/1

pd3 commented 3 years ago

First fix the VCF header and we can start from there. Also a test case, i.e. a small VCF including its header, is required to investigate this further.

samtools / bcftools

split-vep CSQ extraction #1572

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1