Closed MattWellie closed 1 month ago
So what you are telling me is that the spicy needs to go on top not in the middle?
The ingredients need to be seasoned earlier, but only cooked once the guests arrive.
I was thinking about this a bit more, and it would be fine to do both the concordance and ID updates post-annotation, I just thought the SVConcordance running on a smaller VCF (pre-annotation) would be logical.
The process as it stands uses compressed data consistently, following the earlier SpicyVCF failures, so the size of the inputs is not currently a limitation. We could generate sites-only versions of the VCFs for concordance, that would make that whole process faster, but... I might stay complacent for now and only tweak that if there are failures
AnnotateVcf rearranges the variants by ID. I think it does some splitting weirdness to annotate each different variant type separately, then re-join, and at some point during that process there's a sort on ID.
The process we currently have:
UpdateStructuralVariantIDs
runs on that result, otherwise it does not runSpiceUpSVIDs
runs, updating IDs to match a corresponding variant in a prior dataset - it either runs on the output ofFilterGenotypes
orUpdateStructuralVariantIDs
, depending if the latter was invokedThe process we need:
UpdateStructuralVariantIDs
runs on that result, otherwise it does not runFilterGenotypes
orUpdateStructuralVariantIDs
, depending if the latter was invokedSpiceUpSVIDs
runs, updating IDs to match a corresponding variant in a prior dataset - it runs on the output ofAnnotateVcfWithStrvctvre
, and either uses the 'truth IDs' embedded in the VCF info (ifUpdateStructuralVariantIDs
was run), or it generates new IDs based on variant attributes. IfUpdateStructuralVariantIDs
runs and there was no 'Truth ID' found for a variant, an ID is generated for it deterministically based on variant attributes, following the same logic for that variant as if there had been no update/concordance testBasically, update the
TRUTH_VID
variant annotations before passing through annotation, but only update the actual VCF ID field with that information post-annotation.Note to future self: an optimised version of this process could be - run the whole SV workflow through to annotation as normal, then (based on a sites-only version of the spicy VCF for
super
optimisation) run the SV concordance check, then update the VCF IDs.This version which updates the
TRUTH_VID
attributes pre-annotation but doesn't apply them until post-annotation is disjointed, but operates each stage on the smallest data possible. 99% of the VCF 'weight' is in the genotypes rather than the annotation, so I don't think there's much benefit to this PR's implementation, other than it ensures that we don't in any way interfere with VCF annotations until the consequence annotation has taken place