Closed oushujun closed 6 years ago
Hi @oushujun,
I think the warnings occur because the INFO and FORMAT header lines are missing a '>' at the end of the line. You could try fixing that and see if the KeyError also goes away.
However I would have thought VCF parsing would still work despite the dodgy header lines. I.e., the snippet you posted should still parse. I'm not at a computer now but will try it later. A direct way to test whether the parsing is working is to run the following in a Python session:
import allel
callset = allel.read_vcf('path/to/your.vcf')
Then accessing callset['calldata/GT'] should return a numpy array.
Hi @alimanfoo,
Thank you very much for your quick response! I fixed the header lines by adding '>' at each annotation lines and the error cggh/scikit-allel #198 goes away! That means VCF parsing relies on a correct header format. Maybe this is not critically necessary?
Thanks again!
Shujun
Hi Shujun, glad that fixed it. I remember now there is a way to get scikit-allel to work around problems in the headers, but in this case fixing the headers is probably the best solution.
Hi @kern-lab, FWIW when using scikit-allel to read a VCF, if you know you only ever need the CHROM, POS and GT fields, you can say this, e.g.:
vcfFile = allel.read_vcf(vcfForMaskFileName, fields=['variants/CHROM', 'variants/POS', 'calldata/GT'])
This will save a bit of time and memory as only the fields you asked for will be parsed and loaded into numpy. Also this will work even if the VCF file has broken or missing headers.
Cool to see scikit-allel in action!
thanks for the tip!
Hi!
I looked at my vcf file and it doesn't even have the opening "<" does that mean I need to add "<" and ">" for my headers
@alimanfoo
ack i still haven't fixed this!! i will
Hello,
I encounter an error when applying diploSHIC on my data. Not sure if this is due to the imputation data format. I realized the method can condition on missing data, but I still used imputation because samples were sequenced in different depth and hence have systematic difference of missing data rate in different populations (i.e. missing <10% in some population but 80% in other populations).
The command line I used:
I received warnings in reading the VCF header:
After a while of running, I received an error:
The beagle VCF format looks like this:
I am also running the monsquito example and it's producing output now, so I think the program is correctly installed.
Please let me know if there is an easy fix in the program or on my beagle vcf files. Both works for me. Thank you very much!
Best, Shujun