opencb / biodata

Java library that models biological entities and their equivalents in different file formats typically used in bioinformatics
Apache License 2.0
29 stars 34 forks source link

Improve the secondary alternate representation #86

Open j-coll opened 8 years ago

j-coll commented 8 years ago

Multi-allelic variants where introduced in #17 by adding a new List<String> with the secondary alternates. This approach has some problems for mixtures of SNPs and INDELs when the normalization changes the starting position of the variant, or the length of the reference.

Example:

Chr Start Ref Alternates Genotypes
1 1000 C CA,T 0 1 1 2
2 2000 TACC TATC,T 0/1 1/2
3 3000 GTACC GCC,G 0/1 1/2

Will be transformed into:

Chr Start Ref Main alt Secondary alts Genotypes
1 1001 - A T 0 1 1 2
1 1000 C T A 0 2 2 1
2 2002 C T - 0/1 1/2
2 2001 ACC - T 0/2 1/2
3 3001 TA - - 0/1 1/2
3 3001 TACC - - 0/2 1/2

A more complex structure is needed to represent the position mismatch, and in the future, other more complex variants.

The proposal is replace the String of the secondary alternate with an object similar to the VariantKeyFields with position, reference and alternate. The example above will be represented like this:

Chr Start Ref Main alt Secondary alts Genotypes
1 1001 - A 1000:C:T 0 1 1 2
1 1000 C T 1001:-:A 0 2 2 1
2 2002 C T 2001:ACC:- 0/1 1/2
2 2001 ACC - 2002:C:T 0/2 1/2
3 3001 TA - 3001:TACC:- 0/1 1/2
3 3001 TACC - 3001:TA:- 0/2 1/2
cyenyxe commented 8 years ago

Sounds like a reasonable approach. Have you considered how a conversion between allele numbers and nucleotide strings would work?

j-coll commented 8 years ago

I suppose that the question is related with OpenCGA#17. Use nucleotides instead of allele codes is only possible when all the alternates starts at the same position. In that case, the exporter tool will need to add extra nucleotides to the alternates to make them start in the same position. A similar operation is done when exporting to VCF format, where the REF and ALT columns can not be empty and need extra nucleotides.

That means that, for the moment, strange genotypes compositions with positions and nucleotides (like for variant chr1 99 GTC GTA,G that normalizes into chr1 101 C A,(100:TC:-) and chr1 100 TC -,(101:C:A), where the genotype 1/2 is converted into A/100:TC:-) are discarted.