Improve the secondary alternate representation

j-coll commented 8 years ago

Multi-allelic variants where introduced in #17 by adding a new List<String> with the secondary alternates. This approach has some problems for mixtures of SNPs and INDELs when the normalization changes the starting position of the variant, or the length of the reference.

Example:

Chr	Start	Ref	Alternates	Genotypes
1	1000	C	CA,T	0	1 1	2
2	2000	TACC	TATC,T	0/1 1/2
3	3000	GTACC	GCC,G	0/1 1/2

Will be transformed into:

Chr	Start	Ref	Main alt	Secondary alts	Genotypes
1	1001	-	A	T	0	1 1	2
1	1000	C	T	A	0	2 2	1
2	2002	C	T	-	0/1 1/2
2	2001	ACC	-	T	0/2 1/2
3	3001	TA	-	-	0/1 1/2
3	3001	TACC	-	-	0/2 1/2

A more complex structure is needed to represent the position mismatch, and in the future, other more complex variants.

The proposal is replace the String of the secondary alternate with an object similar to the VariantKeyFields with position, reference and alternate. The example above will be represented like this:

Chr	Start	Ref	Main alt	Secondary alts	Genotypes
1	1001	-	A	1000:C:T	0	1 1	2
1	1000	C	T	1001:-:A	0	2 2	1
2	2002	C	T	2001:ACC:-	0/1 1/2
2	2001	ACC	-	2002:C:T	0/2 1/2
3	3001	TA	-	3001:TACC:-	0/1 1/2
3	3001	TACC	-	3001:TA:-	0/2 1/2

cyenyxe commented 8 years ago

Sounds like a reasonable approach. Have you considered how a conversion between allele numbers and nucleotide strings would work?

j-coll commented 8 years ago

I suppose that the question is related with OpenCGA#17. Use nucleotides instead of allele codes is only possible when all the alternates starts at the same position. In that case, the exporter tool will need to add extra nucleotides to the alternates to make them start in the same position. A similar operation is done when exporting to VCF format, where the REF and ALT columns can not be empty and need extra nucleotides.

That means that, for the moment, strange genotypes compositions with positions and nucleotides (like for variant chr1 99 GTC GTA,G that normalizes into chr1 101 C A,(100:TC:-) and chr1 100 TC -,(101:C:A), where the genotype 1/2 is converted into A/100:TC:-) are discarted.

opencb / biodata

Improve the secondary alternate representation #86