Move SV ID updates to post annotation

AnnotateVcf rearranges the variants by ID. I think it does some splitting weirdness to annotate each different variant type separately, then re-join, and at some point during that process there's a sort on ID.

The process we currently have:

FilterGenotypes produces the final VCF
If a prior VCF is available, UpdateStructuralVariantIDs runs on that result, otherwise it does not run
SpiceUpSVIDs runs, updating IDs to match a corresponding variant in a prior dataset - it either runs on the output of FilterGenotypes or UpdateStructuralVariantIDs, depending if the latter was invoked
Annotation, etc.

The process we need:

FilterGenotypes produces the final VCF
If a prior VCF is available, UpdateStructuralVariantIDs runs on that result, otherwise it does not run
Annotation, etc. - either runs on the output of FilterGenotypes or UpdateStructuralVariantIDs, depending if the latter was invoked
SpiceUpSVIDs runs, updating IDs to match a corresponding variant in a prior dataset - it runs on the output of AnnotateVcfWithStrvctvre, and either uses the 'truth IDs' embedded in the VCF info (if UpdateStructuralVariantIDs was run), or it generates new IDs based on variant attributes. If UpdateStructuralVariantIDs runs and there was no 'Truth ID' found for a variant, an ID is generated for it deterministically based on variant attributes, following the same logic for that variant as if there had been no update/concordance test

Basically, update the TRUTH_VID variant annotations before passing through annotation, but only update the actual VCF ID field with that information post-annotation.

Note to future self: an optimised version of this process could be - run the whole SV workflow through to annotation as normal, then (based on a sites-only version of the spicy VCF for super optimisation) run the SV concordance check, then update the VCF IDs.

This version which updates the TRUTH_VID attributes pre-annotation but doesn't apply them until post-annotation is disjointed, but operates each stage on the smallest data possible. 99% of the VCF 'weight' is in the genotypes rather than the annotation, so I don't think there's much benefit to this PR's implementation, other than it ensures that we don't in any way interfere with VCF annotations until the consequence annotation has taken place

So what you are telling me is that the spicy needs to go on top not in the middle?

The ingredients need to be seasoned earlier, but only cooked once the guests arrive.

I was thinking about this a bit more, and it would be fine to do both the concordance and ID updates post-annotation, I just thought the SVConcordance running on a smaller VCF (pre-annotation) would be logical.

The process as it stands uses compressed data consistently, following the earlier SpicyVCF failures, so the size of the inputs is not currently a limitation. We could generate sites-only versions of the VCFs for concordance, that would make that whole process faster, but... I might stay complacent for now and only tweak that if there are failures

populationgenomics / production-pipelines

Move SV ID updates to post annotation #761