Open splaisan opened 2 years ago
Dear all,
I post this here to inform other potential users about current limitations. It is not a BUG but more an incomplete handling of the conversion, leading to an invalid VCF header.
The header created from my conversion does not follow the current rules from https://samtools.github.io/hts-specs/VCFv4.1.pdf (or above)
It is possible that this is due to the GVF data I used from the ensemble data (chicken dbsnp file from http://ftp.ensembl.org/pub/release-105/variation/gvf/gallus_gallus/gallus_gallus.gvf.gz)
##gff-version 3 ##gvf-version 1.07 ##file-date 2021-08-24 ##genome-build ensembl GRCg6a ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9031 ##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/so.obo?revision=1.283 ##data-source Source=ensembl;version=105;url=http://vertebrates.ensembl.org/Gallus_gallus ##file-version 105 ##sequence-region 8 1 30219446 ##sequence-region 6 1 36374701 ##sequence-region 12 1 20387278 ##sequence-region 25 1 3980610 ##sequence-region 10 1 21119840 ##sequence-region 33 1 7821666 ##sequence-region 32 1 725831 ##sequence-region 3 1 110838418 ##sequence-region 11 1 20200042 ##sequence-region 20 1 13897287 ##sequence-region 1 1 197608386 ##sequence-region W 1 6813114 ##sequence-region 15 1 13062184 ##sequence-region 16 1 2844601 ##sequence-region 28 1 5116882 ##sequence-region 7 1 36742308 ##sequence-region KZ626819.1 1 149503 ##sequence-region KZ626834.1 1 665899 ##sequence-region 31 1 6153034 ##sequence-region 19 1 10323212 ##sequence-region 26 1 6055710 ##sequence-region 21 1 6844979 ##sequence-region 23 1 6149580 ##sequence-region 17 1 10762512 ##sequence-region 24 1 6491222 ##sequence-region 30 1 1818525 ##sequence-region 9 1 24153086 ##sequence-region 22 1 5459462 ##sequence-region 18 1 11373140 ##sequence-region AADN05001473.1 1 4777 ##sequence-region 2 1 149682049 ##sequence-region 27 1 8080432 ##sequence-region 4 1 91315245 ##sequence-region 13 1 19166714 ##sequence-region Z 1 82529921 ##sequence-region 5 1 59809098 ##sequence-region MT 1 16775 ##sequence-region 14 1 16219308
The first can be corrected manually and the dictionary can be regenerated using Picard
##fileformat=VCFv4.1 ##RSAT; Phased False ##RSAT; Homozygotes False ##gff-version 3 ##gvf-version 1.07 ##file-date 2021-08-24 ##genome-build ensembl GRCg6a ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9031 ##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/so.obo?revision=1.283 ##data-source Source=ensembl;version=105;url=http://vertebrates.ensembl.org/Gallus_gallus ##file-version 105 ##sequence-region 8 1 30219446 ##sequence-region 6 1 36374701 ##sequence-region 12 1 20387278 ##sequence-region 25 1 3980610 ##sequence-region 10 1 21119840 ##sequence-region 33 1 7821666 ##sequence-region 32 1 725831 ##sequence-region 3 1 110838418 ##sequence-region 11 1 20200042 ##sequence-region 20 1 13897287 ##sequence-region 1 1 197608386 ##sequence-region W 1 6813114 ##sequence-region 15 1 13062184 ##sequence-region 16 1 2844601 ##sequence-region 28 1 5116882 ##sequence-region 7 1 36742308 ##sequence-region KZ626819.1 1 149503 ##sequence-region KZ626834.1 1 665899 ##sequence-region 31 1 6153034 ##sequence-region 19 1 10323212 ##sequence-region 26 1 6055710 ##sequence-region 21 1 6844979 ##sequence-region 23 1 6149580 ##sequence-region 17 1 10762512 ##sequence-region 24 1 6491222 ##sequence-region 30 1 1818525 ##sequence-region 9 1 24153086 ##sequence-region 22 1 5459462 ##sequence-region 18 1 11373140 ##sequence-region AADN05001473.1 1 4777 ##sequence-region 2 1 149682049 ##sequence-region 27 1 8080432 ##sequence-region 4 1 91315245 ##sequence-region 13 1 19166714 ##sequence-region Z 1 82529921 ##sequence-region 5 1 59809098 ##sequence-region MT 1 16775 ##sequence-region 14 1 16219308 #CHROM POS ID REF ALT QUAL FILTER INFO
Thanks for reporting @splaisan , will you have time to look into this @santanaw ?
Dear all,
I post this here to inform other potential users about current limitations. It is not a BUG but more an incomplete handling of the conversion, leading to an invalid VCF header.
The header created from my conversion does not follow the current rules from https://samtools.github.io/hts-specs/VCFv4.1.pdf (or above)
It is possible that this is due to the GVF data I used from the ensemble data (chicken dbsnp file from http://ftp.ensembl.org/pub/release-105/variation/gvf/gallus_gallus/gallus_gallus.gvf.gz)
The GVF header shows as this:
Which gets transformed to the following VCF header (lacking '=' signs between variables and values and with wrong dictionary format)
The first can be corrected manually and the dictionary can be regenerated using Picard