Open j-coll opened 8 years ago
Sounds like a reasonable approach. Have you considered how a conversion between allele numbers and nucleotide strings would work?
I suppose that the question is related with OpenCGA#17. Use nucleotides instead of allele codes is only possible when all the alternates starts at the same position. In that case, the exporter tool will need to add extra nucleotides to the alternates to make them start in the same position. A similar operation is done when exporting to VCF format, where the REF and ALT columns can not be empty and need extra nucleotides.
That means that, for the moment, strange genotypes compositions with positions and nucleotides (like for variant chr1 99 GTC GTA,G
that normalizes into chr1 101 C A,(100:TC:-)
and chr1 100 TC -,(101:C:A)
, where the genotype 1/2
is converted into A/100:TC:-
) are discarted.
Multi-allelic variants where introduced in #17 by adding a new
List<String>
with the secondary alternates. This approach has some problems for mixtures of SNPs and INDELs when the normalization changes the starting position of the variant, or the length of the reference.Example:
Will be transformed into:
A more complex structure is needed to represent the position mismatch, and in the future, other more complex variants.
The proposal is replace the String of the secondary alternate with an object similar to the
VariantKeyFields
with position, reference and alternate. The example above will be represented like this: