madprime / cgivar2gvcf

Lossy conversion of Complete Genomics var file to VCF
MIT License
5 stars 3 forks source link

Invalid gVCF lines #5

Open abeconnelly opened 8 years ago

abeconnelly commented 8 years ago

Here is a snippet of a CGI-Var file:

1265    2       all     chr1    68316   68543   ref     =       =                                                       
1266    2       all     chr1    68543   68550   no-call =       ?                                                       
1267    2       all     chr1    68550   68640   ref     =       =                                                       
1268    2       all     chr1    68640   68640   no-call =       ?                                                       
1269    2       all     chr1    68640   68893   ref     =       =                                                       
1270    2       1       chr1    68893   68896   no-call TAG     ?                                                       
1270    2       2       chr1    68893   68896   snp     TAG     TAA     96      96      PASS            dbsnp.100:rs2854683             

that, after running cgivar2gvcf produces:

chr1    68317   .       T       .       .       PASS    END=68543       GT      0/0
chr1    68544   .       T       .       .       NOCALL  END=68550       GT      ./.
chr1    68551   .       C       .       .       PASS    END=68640       GT      0/0
chr1    68641   .       T       .       .       NOCALL  END=68640       GT      ./.
chr1    68641   .       T       .       .       PASS    END=68893       GT      0/0
chr1    68894   rs2854683       TAG     TAA     .       NOCALL  .       GT      1/.

As you can see, there are two lines beginning at different start points (68551 and 86641) but ending at the same endpoint (68640). I'm not sure if this is actually an error in the CGI-Var file as the problem looks to have stemmed from the 0-length 'no-call' line in the originating CGI-Var file.

I've attached a small test CGI-Var file will produce the above gVCF when run against cgivar2gvcf. indel_nstar.cgivar.txt

madprime commented 8 years ago

It looks like this is a broader issue, the handling of zero-width positions is generally not handled well in the current gVCF translation.

The complete genomics format represents some types of variations with zero-width reference length, but VCF needs a width of at least one for reference position. For insertion variants the solution was to back up one position and use that base as reference, and prepend it to the variation. That was fine for VCF, but the addition of reference and no-call lines in gVCF means more needs to be done. (e.g. for an insertion the preceding reference line should also be edited to shift the end backwards.)