populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Move SV ID updates to post annotation #761

Closed MattWellie closed 1 month ago

MattWellie commented 1 month ago

AnnotateVcf rearranges the variants by ID. I think it does some splitting weirdness to annotate each different variant type separately, then re-join, and at some point during that process there's a sort on ID.

The process we currently have:

The process we need:

Basically, update the TRUTH_VID variant annotations before passing through annotation, but only update the actual VCF ID field with that information post-annotation.

Note to future self: an optimised version of this process could be - run the whole SV workflow through to annotation as normal, then (based on a sites-only version of the spicy VCF for super optimisation) run the SV concordance check, then update the VCF IDs.

This version which updates the TRUTH_VID attributes pre-annotation but doesn't apply them until post-annotation is disjointed, but operates each stage on the smallest data possible. 99% of the VCF 'weight' is in the genotypes rather than the annotation, so I don't think there's much benefit to this PR's implementation, other than it ensures that we don't in any way interfere with VCF annotations until the consequence annotation has taken place

MattWellie commented 1 month ago

So what you are telling me is that the spicy needs to go on top not in the middle?

The ingredients need to be seasoned earlier, but only cooked once the guests arrive.

I was thinking about this a bit more, and it would be fine to do both the concordance and ID updates post-annotation, I just thought the SVConcordance running on a smaller VCF (pre-annotation) would be logical.

The process as it stands uses compressed data consistently, following the earlier SpicyVCF failures, so the size of the inputs is not currently a limitation. We could generate sites-only versions of the VCFs for concordance, that would make that whole process faster, but... I might stay complacent for now and only tweak that if there are failures