molgenis / VaSeBuilder

Validation Set Builder
GNU Lesser General Public License v3.0
1 stars 3 forks source link

Add filter list (e.g. MVL) fields to VCF output INFO field #93

Open TDMedina opened 4 years ago

TDMedina commented 4 years ago

In latest pull request, varconfile outputs include a list of variants per varcon. It does not include information about those variants, such as pathogenicity annotations. It would be difficult to add these, due to merged contexts. I think it would be really messy to try to cram in arbitrary annotations into a single column containing a list of variants.

The most reliable per-variant output we have right now are the VCF slices. We could, instead, add any desired annotations from the filter list to the INFO fields in the VCF. It would be a bit unwieldy to get these annotations back, but it would be reliable and expandable.

My use case: Figuring out the pathogenicity of the variants we included. Currently, this is only stored in the filter list, originally made from the MVL. So to get this information, I have to:

  1. Get a variant from the VCF slice(s)
  2. Look up that variant locus in the variant context file to find out what sample (hash) it comes from (because, surprise surprise, occasionally two of the same variants from two different samples have different classifications in the MVL)
    • Have to check both the varcon IDs AND the included variants in case the variant got merged into another varcon
    • OR check every varcon for an overlapping window with the variant locus
  3. Look up the hashed sample ID in the hash table.
  4. Look up the sample ID + variant locus in the MVL.

That's a ton of work to check a single annotation. We could also make life a little easier, if a little more cluttered, by directly adding the ID hashes to the VCF as well. In any case, the fields to be added as INFO annotations could be added as a new command line option in the new version. Would need to add these to the INFO definition headers in the VCF slice headers as well, check for conflicts between custom MVL headers and VCF INFO definitions, etc., but it would be easy, I think.