kr-colab / diploSHIC

feature-based deep learning for the identification of selective sweeps
MIT License
49 stars 14 forks source link

KeyError: 'calldata/GT' with beagle imputed data #3

Closed oushujun closed 6 years ago

oushujun commented 6 years ago

Hello,

I encounter an error when applying diploSHIC on my data. Not sure if this is due to the imputation data format. I realized the method can condition on missing data, but I still used imputation because samples were sequenced in different depth and hence have systematic difference of missing data rate in different populations (i.e. missing <10% in some population but 80% in other populations).

The command line I used:

$ sudo python /opt/software/SHIC/diploSHIC/diploSHIC.py fvecVcf diploid rice3k.SNP.imputed.Chr1.AR2_95.gz Chr1 43270923 rice3k.SNP.imputed.Chr1.AR2_95.diploid.aro.fvec --targetPop aro --sampleToPopFileName subgroup.list --winSize 55000

I received warnings in reading the VCF header:

/opt/software/miniconda/4.4.10--GCC-4.9.4/lib/python3.6/site-packages/allel/io/vcf_read.py:1516: UserWarning: invalid INFO header: '##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies"\n' warnings.warn('invalid INFO header: %r' % header) /opt/software/miniconda/4.4.10--GCC-4.9.4/lib/python3.6/site-packages/allel/io/vcf_read.py:1516: UserWarning: invalid INFO header: '##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose"\n' warnings.warn('invalid INFO header: %r' % header) /opt/software/miniconda/4.4.10--GCC-4.9.4/lib/python3.6/site-packages/allel/io/vcf_read.py:1516: UserWarning: invalid INFO header: '##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose"\n' warnings.warn('invalid INFO header: %r' % header) /opt/software/miniconda/4.4.10--GCC-4.9.4/lib/python3.6/site-packages/allel/io/vcf_read.py:1516: UserWarning: invalid INFO header: '##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker"\n' warnings.warn('invalid INFO header: %r' % header) /opt/software/miniconda/4.4.10--GCC-4.9.4/lib/python3.6/site-packages/allel/io/vcf_read.py:1527: UserWarning: invalid FORMAT header: '##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"\n' warnings.warn('invalid FORMAT header: %r' % header) /opt/software/miniconda/4.4.10--GCC-4.9.4/lib/python3.6/site-packages/allel/io/vcf_read.py:1527: UserWarning: invalid FORMAT header: '##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]"\n' warnings.warn('invalid FORMAT header: %r' % header) /opt/software/miniconda/4.4.10--GCC-4.9.4/lib/python3.6/site-packages/allel/io/vcf_read.py:1527: UserWarning: invalid FORMAT header: '##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability"\n' warnings.warn('invalid FORMAT header: %r' % header)

After a while of running, I received an error:

Warning: a mask.fa file for the chr arm with all masked sites N'ed out is strongly recommended (pass in the reference to remove Ns at the very least)! Traceback (most recent call last): File "/opt/software/SHIC/diploSHIC/makeFeatureVecsForChrArmFromVcfDiploid.py", line 84, in rawgenos = np.take(vcfFile["calldata/GT"], [i for i in range(len(chroms)) if chroms[i] == chrArm], axis=0) KeyError: 'calldata/GT'

The beagle VCF format looks like this:

fileformat=VCFv4.2

filedate=20180403

source="beagle.27Jan18.7e1.jar (version 4.1)"

INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies"

INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF d

INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] a

INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker"

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"

FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]"

FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability"

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ERS467753 ERS467754 ...

Chr1 1131 . T C . PASS AR2=0.98;DR2=0.98;AF=0.00089 GT:DS:GP 0/0:0:1,0,0 0/0:0:1,0,0 ... Chr1 1151 . C A . PASS AR2=1.00;DR2=1.00;AF=0.012 GT:DS:GP 0/0:0:1,0,0 0/0:0:1,0,0 ... ...

I am also running the monsquito example and it's producing output now, so I think the program is correctly installed.

Please let me know if there is an easy fix in the program or on my beagle vcf files. Both works for me. Thank you very much!

Best, Shujun

alimanfoo commented 6 years ago

Hi @oushujun,

I think the warnings occur because the INFO and FORMAT header lines are missing a '>' at the end of the line. You could try fixing that and see if the KeyError also goes away.

However I would have thought VCF parsing would still work despite the dodgy header lines. I.e., the snippet you posted should still parse. I'm not at a computer now but will try it later. A direct way to test whether the parsing is working is to run the following in a Python session:

import allel
callset = allel.read_vcf('path/to/your.vcf')

Then accessing callset['calldata/GT'] should return a numpy array.

oushujun commented 6 years ago

Hi @alimanfoo,

Thank you very much for your quick response! I fixed the header lines by adding '>' at each annotation lines and the error cggh/scikit-allel #198 goes away! That means VCF parsing relies on a correct header format. Maybe this is not critically necessary?

Thanks again!

Shujun

alimanfoo commented 6 years ago

Hi Shujun, glad that fixed it. I remember now there is a way to get scikit-allel to work around problems in the headers, but in this case fixing the headers is probably the best solution.

alimanfoo commented 6 years ago

Hi @kern-lab, FWIW when using scikit-allel to read a VCF, if you know you only ever need the CHROM, POS and GT fields, you can say this, e.g.:

vcfFile = allel.read_vcf(vcfForMaskFileName, fields=['variants/CHROM', 'variants/POS', 'calldata/GT'])

This will save a bit of time and memory as only the fields you asked for will be parsed and loaded into numpy. Also this will work even if the VCF file has broken or missing headers.

Cool to see scikit-allel in action!

andrewkern commented 6 years ago

thanks for the tip!

s9ahuja commented 4 years ago

Hi!

I looked at my vcf file and it doesn't even have the opening "<" does that mean I need to add "<" and ">" for my headers

@alimanfoo

andrewkern commented 4 years ago

ack i still haven't fixed this!! i will