rsa-tools / rsat-code

This repo contains the code required to run a local version of the software suite Regulatory Sequence Analysis Tools (RSAT).
http://rsat.eu
GNU Affero General Public License v3.0
5 stars 6 forks source link

RSAT convert-variations to VCF create invalid header #32

Open splaisan opened 2 years ago

splaisan commented 2 years ago

Dear all,

I post this here to inform other potential users about current limitations. It is not a BUG but more an incomplete handling of the conversion, leading to an invalid VCF header.

The header created from my conversion does not follow the current rules from https://samtools.github.io/hts-specs/VCFv4.1.pdf (or above)

It is possible that this is due to the GVF data I used from the ensemble data (chicken dbsnp file from http://ftp.ensembl.org/pub/release-105/variation/gvf/gallus_gallus/gallus_gallus.gvf.gz)

The GVF header shows as this:

##gff-version 3
##gvf-version 1.07
##file-date 2021-08-24
##genome-build ensembl GRCg6a
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9031
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/so.obo?revision=1.283
##data-source Source=ensembl;version=105;url=http://vertebrates.ensembl.org/Gallus_gallus
##file-version 105
##sequence-region 8 1 30219446
##sequence-region 6 1 36374701
##sequence-region 12 1 20387278
##sequence-region 25 1 3980610
##sequence-region 10 1 21119840
##sequence-region 33 1 7821666
##sequence-region 32 1 725831
##sequence-region 3 1 110838418
##sequence-region 11 1 20200042
##sequence-region 20 1 13897287
##sequence-region 1 1 197608386
##sequence-region W 1 6813114
##sequence-region 15 1 13062184
##sequence-region 16 1 2844601
##sequence-region 28 1 5116882
##sequence-region 7 1 36742308
##sequence-region KZ626819.1 1 149503
##sequence-region KZ626834.1 1 665899
##sequence-region 31 1 6153034
##sequence-region 19 1 10323212
##sequence-region 26 1 6055710
##sequence-region 21 1 6844979
##sequence-region 23 1 6149580
##sequence-region 17 1 10762512
##sequence-region 24 1 6491222
##sequence-region 30 1 1818525
##sequence-region 9 1 24153086
##sequence-region 22 1 5459462
##sequence-region 18 1 11373140
##sequence-region AADN05001473.1 1 4777
##sequence-region 2 1 149682049
##sequence-region 27 1 8080432
##sequence-region 4 1 91315245
##sequence-region 13 1 19166714
##sequence-region Z 1 82529921
##sequence-region 5 1 59809098
##sequence-region MT 1 16775
##sequence-region 14 1 16219308

Which gets transformed to the following VCF header (lacking '=' signs between variables and values and with wrong dictionary format)

The first can be corrected manually and the dictionary can be regenerated using Picard

##fileformat=VCFv4.1
##RSAT; Phased                 False
##RSAT; Homozygotes            False
##gff-version 3
##gvf-version 1.07
##file-date 2021-08-24
##genome-build ensembl GRCg6a
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9031
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/so.obo?revision=1.283
##data-source Source=ensembl;version=105;url=http://vertebrates.ensembl.org/Gallus_gallus
##file-version 105
##sequence-region 8 1 30219446
##sequence-region 6 1 36374701
##sequence-region 12 1 20387278
##sequence-region 25 1 3980610
##sequence-region 10 1 21119840
##sequence-region 33 1 7821666
##sequence-region 32 1 725831
##sequence-region 3 1 110838418
##sequence-region 11 1 20200042
##sequence-region 20 1 13897287
##sequence-region 1 1 197608386
##sequence-region W 1 6813114
##sequence-region 15 1 13062184
##sequence-region 16 1 2844601
##sequence-region 28 1 5116882
##sequence-region 7 1 36742308
##sequence-region KZ626819.1 1 149503
##sequence-region KZ626834.1 1 665899
##sequence-region 31 1 6153034
##sequence-region 19 1 10323212
##sequence-region 26 1 6055710
##sequence-region 21 1 6844979
##sequence-region 23 1 6149580
##sequence-region 17 1 10762512
##sequence-region 24 1 6491222
##sequence-region 30 1 1818525
##sequence-region 9 1 24153086
##sequence-region 22 1 5459462
##sequence-region 18 1 11373140
##sequence-region AADN05001473.1 1 4777
##sequence-region 2 1 149682049
##sequence-region 27 1 8080432
##sequence-region 4 1 91315245
##sequence-region 13 1 19166714
##sequence-region Z 1 82529921
##sequence-region 5 1 59809098
##sequence-region MT 1 16775
##sequence-region 14 1 16219308
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
eead-csic-compbio commented 4 weeks ago

Thanks for reporting @splaisan , will you have time to look into this @santanaw ?