opencb / opencga

An Open Computational Genomics Analysis platform for big data genomics analysis. OpenCGA is maintained and develop by its parent company Zetta Genomics. Please contact support@zettagenomics.com for bug report and feature requests.
Apache License 2.0
166 stars 97 forks source link

Load inconsistent gVCF data #244

Open mh11 opened 8 years ago

mh11 commented 8 years ago

In a gVCF file for one individual, the variant calls can be inconsistent in itself. An extract of this can be found below:

1       10403   .       ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC  A  CIGAR=1M37D ...       0/1
1       10427   .       A       AC  CIGAR=1M1I  ... 0/1
1       10428   .       C       .       .       END=10435  ...       0/0

The first row is a deletion of 37 bases starting from position 10403 (end would be 10441) with 0/1. This is followed by an insertion of one base at 10427 with 0/1. The third row is a region of a reference, spanning from 10428 to 10435 with 0/0

This would mean from 10428 to 10435, the same individual has a deletion of one allele as well as reference of both alleles. And at 10427, deletion of one allele + one reference and one insertion + one reference.

We consider this situation as a conflict and error in the input data produced by the variant caller.

Resolution

Overlapping regions are collapsed to one variant and reset to no_call (.) For the above example, this would collapsed into one variant

1       10403   .       A       .       .       END=10441  ...   .  

Issues

In the case that a variant with a 'LowGQXHetIns' flag overlaps with a variant with a 'PASS' flag, possible real variants are getting removed.

1   123 . A ATATATG .  LowGQXHetIns .... 0/1
1   125 . A G . PASS  .....  0/1
mh11 commented 8 years ago

Issue with conflicting SecondaryAlternate regions. The following VCF variants

1:428:CTT:C,CTTTC  1/2 LowGQXHetDel
1:431:T:TCT  0/1 PASS

are getting normalised to

1:329:TT:-  {secAlt=1:331:-:TC} 1/2
1:331:-:TC  0/1

And the SecondaryAlternate and the second Variant are exactly the same (1:331:-:TC). This causes problems further down the line (merging) see https://github.com/opencb/opencga/issues/345

mh11 commented 8 years ago

Insertion as a special case

Down to the insertion start-end position change in https://github.com/opencb/biodata/issues/100 the Conflict resolver has to be aware of insertions and treat them differently. There should only be conflict if overlapping variants e.g. Deletion / reference region cover the insertion gap (or an other insertion at the exact gap.