I noticed that the conversion of a ENSembl gvf file to vcf with rsat convert-variations does not generate valid indel data because ensemble does not report indels the correct way
In VCF4.0
for a deletion the REF should show the base before the deletion + deleted bases and the ALT the base before alone => it shows here T => ‘-‘
for an insertion the REF field should return the base before the insertion and the ALT the base + inserted sequence => it shows here ‘-‘ => TTA
#CHROM POS ID REF ALT QUAL FILTER INFO
27 61538 rs1058809248 T - . . TSA=deletion
27 68214 rs1058087975 - TTA . . TSA=insertion
Example of valid VCF4 data
chr1 10305422 rs60414309 CAA C 7851.53 PASS set=Intersect1000GMinusSI
chr1 10316233 . C CA 2137.29 PASS set=Intersect1000GMinusSI
So basically, the error is not in RSAT but in the ENSembl data.
Would it be possible to fix this, it required looking at the reference sequence to discover the preceding (anchor) base and adding it to both fields.
As said above, this was not your fault as the GVF format doe not return the anchor base before insertions, deletions, substitutions.
I close this issue here
Dear developers,
I noticed that the conversion of a ENSembl gvf file to vcf with rsat convert-variations does not generate valid indel data because ensemble does not report indels the correct way
In VCF4.0
Example of the gvf and converted output
Example of valid VCF4 data
So basically, the error is not in RSAT but in the ENSembl data. Would it be possible to fix this, it required looking at the reference sequence to discover the preceding (anchor) base and adding it to both fields.
Thanks in advance, Stephane