Unable to read VCF data

oalavijeh commented 5 years ago

Hello,

I am trying to parse structural VCF data into vcfR and am running into the following error:

"This file has 861 header elements and 1009 columns in the body. This should never happen"

My data is a merged vcf of 1000 people with structural variants (SV) called by Manta and then subdivided by the type of SV e.g. insertions, deletions etc. Unfortunately I am unable to share this as the work is done via secure airlock.

The merging was done with bcftools if that helpful to know. Looking at the original per patient SV file there are multiple headers as well. The columns represent the patient IDs I guess once merged.

You help would be much appreciated.

Many thanks

knausb commented 5 years ago

Hi @oalavijeh,

First off, this sounds like it is human data. My understanding is that the sharing of human data risks the release of patient information. I do not work in human, but just wanted to validate that you should not share human data. However, this presents a practical challenge because you can not use your data to create a minimal reproducible example.

VCF data typically includes 9 columns that provide information about each variant. After these 9 columns are typically one column per sample. The error you've reported is telling you that your data has 1009 columns which is consistent with your report (9 + 1000). However, your header line (begining with #CHROM) has only 861 elements. This means you have columns without header elements. Stated slightly differently, I think you have samples (patients) that have no name or identifier. This would either be because Manta did not include them (I have no experience with Manta) or your bcftools step omitted them. Or there's something you're not telling me :) If you're working in a Unix environment (I use Ubuntu) you could validate this using as text editor such as less or the below command.

zgrep "^#CHROM" my_VCF_data.vcf.gz | wc -w

Here the result should be 1009, but I'm going to guess that it will instead report 861, which is why you're getting the error.

If my interpretation of the situation is correct than it seems like the prudent step would be to make sure your samples all have some sort of identifier in the header line. Please let me know if this is correct (or not).

Brian

oalavijeh commented 5 years ago

HI Brian,

Thank you for such a helpful and clear to understand answer especially given how little you had to work with! You are correct, running the above code give 861. I will see if I can autopopulate these headers somehow (many bcftools?).

Many thanks

Omid

knausb commented 5 years ago

Glad I could help! And remember, VCF data are just text files. So you could open it in vim and edit it manually. You could also do some grepping.

grep "^##" my_vcf_file.vcf > my_meta.txt
grep "^#CHROM my_vcf_file.vcf > my_header.txt
grep -v "^#" my_vcf_file.vcf > my_body.txt
cat my_meta.txt my_new_header.txt my_body.txt > new_vcf_file.vcf

Good luck!

knausb / vcfR

Unable to read VCF data #146