rsa-tools / rsat-code

This repo contains the code required to run a local version of the software suite Regulatory Sequence Analysis Tools (RSAT).
http://rsat.eu
GNU Affero General Public License v3.0
5 stars 6 forks source link

rsat convert-variations VCF format error for indels #31

Closed splaisan closed 2 years ago

splaisan commented 2 years ago

Dear developers,

I noticed that the conversion of a ENSembl gvf file to vcf with rsat convert-variations does not generate valid indel data because ensemble does not report indels the correct way

In VCF4.0

Example of the gvf and converted output

27      dbSNP   deletion        61538   61538   .       +       .       ID=259;Variant_seq=-;Dbxref=dbSNP_150:rs1058809248;Reference_seq=T
27      dbSNP   insertion       68214   68214   .       +       .       ID=437;Variant_seq=TTA;Dbxref=dbSNP_150:rs1058087975;Reference_seq=-
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
27      61538   rs1058809248    T       -       .       .       TSA=deletion
27      68214   rs1058087975    -       TTA     .       .       TSA=insertion

Example of valid VCF4 data

chr1    10305422        rs60414309      CAA     C       7851.53 PASS    set=Intersect1000GMinusSI
chr1    10316233        .       C       CA      2137.29 PASS    set=Intersect1000GMinusSI

So basically, the error is not in RSAT but in the ENSembl data. Would it be possible to fix this, it required looking at the reference sequence to discover the preceding (anchor) base and adding it to both fields.

Thanks in advance, Stephane

splaisan commented 2 years ago

As said above, this was not your fault as the GVF format doe not return the anchor base before insertions, deletions, substitutions. I close this issue here