projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
262 stars 106 forks source link

Unexpected behaviour when using the SplitMultiallelics Transformer with unbounded INFO fields #537

Closed simojoe closed 3 months ago

simojoe commented 8 months ago

Use case : I am parsing a VCF file with glow and am using the SplitMultiallelics transformer to obtain biallelic entries. One of my INFO fields is unbounded, meaning that it is defined with Number=. in the VCF header.

Unexpected behaviour : When splitting, the unbounded INFO is sometimes left intact, sometimes separated into the different alternate alleles.

Due to being unbounded, the said INFO field will have a variable number of elements for each VCF entries. When that number is equal to the number of alternate alleles, the SplitMultiallelics transformer will split it into individual parts, but it will keep it intact if it is not equal.

It seems to me that this is an edge case, because one would expect an INFO field to be either split or intact throughout a single file, not both.

Possible solutions :

henrydavidge commented 3 months ago

I agree that the current behavior is unintuitive / basically incorrect. Your first proposal seems reasonable to me. In general, would you expect an unbounded INFO field to remain unsplit?

henrydavidge commented 3 months ago

@simojoe I opened a PR to resolve this issue. If you have a moment, let me know if it meets your needs.