Use unmasked reference in MergeBam step

lgruen commented 1 year ago

CRAMs aligned during the NAGIM project (corresponding Terra workspace) were using dragen_functional_equivalence_mode and hence used the masked DRAGEN reference at gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta: https://github.com/broadinstitute/warp/blob/a3b438422691f447032038be52c74639dafe914f/tasks/broad/DragmapAlignment.wdl#L58

However, the subsequent MergeBam alignment step points to the unmasked reference at gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta: https://github.com/broadinstitute/warp/blob/a3b438422691f447032038be52c74639dafe914f/tasks/broad/DragmapAlignment.wdl#L70

Effectively, this means that read alignment used the masked reference, but the resulting CRAM header points to the unmasked reference. Ostensibly this was done to increase compatibility with CRAMs that were generated using non-DRAGEN workflows (to prevent MD5 reference mismatches as explained in https://github.com/populationgenomics/production-pipelines/pull/146), but we're still waiting for feedback from the Broad on the rationale: https://github.com/broadinstitute/warp/issues/871

While our production pipelines were switched over to use DRAGEN mode alignment and variant calling as well, the alignment steps used the masked reference consistently (including the merging step). Consequently, some of our CRAMs (the ones resulting from NAGIM, e.g. TOB-WGS) have the unmasked reference set in the header, while others (resulting from running our own pipeline) point to the masked reference. This inconsistency creates confusion (e.g. https://centrepopgen.slack.com/archives/C030X7WGFCL/p1669180382708679) and is likely to also create downstream problems when e.g. trying to federate CRAMs with gnomAD down the track.

Note that variant calling results (i.e. GVCFs) are not affected by this. In either case, the actual alignment of reads has been performed on the masked reference.

Proposal of steps to result in a consistent set of CRAMs:

We'd need to make sure to change the alignment step of the production pipeline to avoid the same issue for new alignments going forward.
To "backfill fix" the existing CRAMs that point to the masked reference, we’d need to rewrite all the CRAM blobs, but that should be very cheap thanks to @vladsavelyev's header-only fix.
In order not to double our storage costs, we should make sure to delete the old CRAM versions manually as soon as we're convinced that the new versions look good.

cassimons commented 1 year ago

Thanks Leo, this is very helpful. Fundamentally, the idea of mixing reference genomes is flawed to me. In my mind, a pipeline should use one reference file and that should be the reference listed in the file headers. I can appreciate the logic behind making the alternative decision that masking is not really changing the genome so generating a final cram with the unmasked cram in the header is 'ok', but it is not a tradeoff I would choose myself.

Having said that, if Broad has consciously made this decision and are willing to share their logic, then I agree there is a benefit to us following suit. However, before we make that call I really want to see this in their own words and know that we really do have an acceptable level of equivalence with their process.

dgmacarthur commented 1 year ago

[replicating a conversation from a Slack DM exchange from mid-December so we have a formal record; sorry this took so long]

This question got escalated to me as there was no obvious solution that had reached consensus. I've had several discussions with @lgruen, @katiedelange, and @cassimons, a Slack conversation with Eric Banks, and read through the various issues. There's clearly no easy answer here, as both sides of the discussion are based on perfectly reasonable principles, and the question is just how much we weight the relevant principles, as well as pragmatic considerations about possible downstream effects - but tl;dr I think we should avoid changing the CRAM headers to point to a reference that wasn't actually used in the alignment.

As far as I can tell the main principles in conflict here are:

the principle of retaining consistency with the Broad model, which would favour reheadering following their pipeline;
the principle of maintaining the accuracy of the information in our files, which @cassimons also refers to as the "principle of least surprise", which favours maintaining the header information as it is to avoid downstream issues to us or other users of the files due to a clash between the header information and the actual approach used to process the files.

In terms of maintaining consistency with the Broad: this is a good thing in general, as it avoids us needing to reinvent every wheel, ensures we're able to federate our data more easily down the line, and has some other advantages for data reuse (e.g. consistency with the standard resource bundle). But we obviously need to adopt this approach cautiously, as there are things the Broad does that we disagree with, or that are not well-suited to our purposes. We shouldn't deviate from the Broad model without good cause, but we should do so if we have good reason to do so and have thought through the downstream consequences. Here we need to acknowledge that the Broad appears to have no principled rationale behind this particular decision, as far as we can tell. It is also the case that if we need to federate our data downstream we could choose to reheader at that stage. Alignment with the resource bundle is another argument, but users should be able to deal with the use of a reference different from the one in the bundle; this doesn't strike me as compelling.

In terms of the risks of reheadering causing downstream issues, I think this is generally low (as Leo argued to me, the reference in the headers is primarily intended to facilitate accurate decompression of the reads rather than reproducibility) but it certainly isn't zero. We don't have full visibility of the way downstream tools will interact with the header info, certainly not for possible external users, but also (as I understand it) even for our own internal processes. If we keep the correct reference in the header and it's in conflict with a downstream use that will generally produce an informative error. If we reheader the CRAM, we have altered the file in a way that seems more likely to create silent errors downstream. It's true that we don't know of an example where this would happen, and from first principles such things shouldn't happen often, but it's also true that we have created a file that's more likely to silently trigger errors due to a mismatch between the header info and the actual properties of the file.

Again, no easy decision here, but overall, on reflection, I think the principle of maintaining the integrity of our files outweighs the arguments in favour of maintaining consistency with the Broad. I'm also inclined to weigh the arguments from Cas and Katie more strongly here, as this is a decision that falls more heavily (though certainly not entirely!) in the analysis domain. So overall, my call here is to keep the headers in our canonical, sharable CRAMs as pointing to the exact reference used in their alignment, and to only alter copies of these if it proves necessary for a specific analysis.

populationgenomics / production-pipelines

Use unmasked reference in MergeBam step #208