opencb / biodata

Java library that models biological entities and their equivalents in different file formats typically used in bioinformatics
Apache License 2.0
29 stars 34 forks source link

Variant end must be a position in the reference genome #100

Open j-coll opened 8 years ago

j-coll commented 8 years ago

At this moment, the field "end" of a variant is the start + length - 1 where length = max(ref.length, alt.length).

In order to make the variant object more consistent, we should use this rule:

Variant coordinates (start, end) should always refer to the reference genome.

The current "end" definitions does not satisfy this rule in the case of variants where the alternate is bigger than the reference, because the end is not referring to a position in the reference genome.

The variants where the alternate is bigger than the reference are the insertions. For example, chr1 100 A ACCT, with the current definition, the length is 4 and the end is 103.

Changing the definition of end to end = start + referenceLength - 1, the end will refer always to the reference genome. For the previous example, end will be 100.

There is a situation that have to be taken in mind, where end is less than the start. If the reference is empty. chr1 101 - CCT, will end in position 100 (because referenceLength is 1).

mh11 commented 8 years ago

Overlapping variants

Since an insertions fall between two base positions, an insertion does not overlap with any SNP calls. The issue is that for a VCF files, the insertions have to be anchored to a genomic position, which is traditionally done to the base before.

A-T-G-T-T-C
A-T-GxT-T-C
    /\ insertion

Any merging of variants should take this into account and merge the insertion with any variants in the previous position.

For two subjects A and B
A 1:99:A:T  0/1  (SNP)
B 1:100::AGG  0/1  (Insertion)
=> Merge to 0/1 (A) and 0/2 (B) where the Insertion is a Secondary Alternate

It can also happen that the same individual has both (a SNP and an Insertion), in which case both variants should be recorded as a list of genotypes. The export to VCF would have to 'merge' them into one variant call e.g. 1:99:A:TAGG TODO

One subject A
A 1:99:A:T  0/1  (SNP)
A 1:100::AGG  0/1  (Insertion)
=> Merge to 0/1,0/2 (A) where the Insertion is a Secondary Alternate