ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

Support for New Missing Data Formatting from GATK #78

Closed ksamuk closed 6 months ago

ksamuk commented 1 year ago

GATK has implemented a (quite radical) new way of encoding missing data, that we will need to support going forward: https://gatk.broadinstitute.org/hc/en-us/articles/6012243429531

ChenJuiYANG commented 6 months ago

Hi Kieran,

I wonder what the result would be for the current version with the new GATK-generated vcf file as input. Does the results reliable? Any suggestion if the new GATK-generated vcf file is not applicable?

Cheers, Chen-Jui

ksamuk commented 6 months ago

Hi Chen-Jui,

I'm not quite sure at the moment, that is going to be a complex fix to implement. In the meantime, a quick fix might be to preprocess your data using bcftools to set genotypes with DP < 1 to "." as below:

bcftools +setGT your.vcf.gz -- -t q -n . -e 'FMT/DP>=1'

Cheers,

Kieran

ksamuk commented 6 months ago

This is now addressed in the latest version of pixy.