Open lgruen opened 1 year ago
Thanks Leo, this is very helpful. Fundamentally, the idea of mixing reference genomes is flawed to me. In my mind, a pipeline should use one reference file and that should be the reference listed in the file headers. I can appreciate the logic behind making the alternative decision that masking is not really changing the genome so generating a final cram with the unmasked cram in the header is 'ok', but it is not a tradeoff I would choose myself.
Having said that, if Broad has consciously made this decision and are willing to share their logic, then I agree there is a benefit to us following suit. However, before we make that call I really want to see this in their own words and know that we really do have an acceptable level of equivalence with their process.
[replicating a conversation from a Slack DM exchange from mid-December so we have a formal record; sorry this took so long]
This question got escalated to me as there was no obvious solution that had reached consensus. I've had several discussions with @lgruen, @katiedelange, and @cassimons, a Slack conversation with Eric Banks, and read through the various issues. There's clearly no easy answer here, as both sides of the discussion are based on perfectly reasonable principles, and the question is just how much we weight the relevant principles, as well as pragmatic considerations about possible downstream effects - but tl;dr I think we should avoid changing the CRAM headers to point to a reference that wasn't actually used in the alignment.
As far as I can tell the main principles in conflict here are:
In terms of maintaining consistency with the Broad: this is a good thing in general, as it avoids us needing to reinvent every wheel, ensures we're able to federate our data more easily down the line, and has some other advantages for data reuse (e.g. consistency with the standard resource bundle). But we obviously need to adopt this approach cautiously, as there are things the Broad does that we disagree with, or that are not well-suited to our purposes. We shouldn't deviate from the Broad model without good cause, but we should do so if we have good reason to do so and have thought through the downstream consequences. Here we need to acknowledge that the Broad appears to have no principled rationale behind this particular decision, as far as we can tell. It is also the case that if we need to federate our data downstream we could choose to reheader at that stage. Alignment with the resource bundle is another argument, but users should be able to deal with the use of a reference different from the one in the bundle; this doesn't strike me as compelling.
In terms of the risks of reheadering causing downstream issues, I think this is generally low (as Leo argued to me, the reference in the headers is primarily intended to facilitate accurate decompression of the reads rather than reproducibility) but it certainly isn't zero. We don't have full visibility of the way downstream tools will interact with the header info, certainly not for possible external users, but also (as I understand it) even for our own internal processes. If we keep the correct reference in the header and it's in conflict with a downstream use that will generally produce an informative error. If we reheader the CRAM, we have altered the file in a way that seems more likely to create silent errors downstream. It's true that we don't know of an example where this would happen, and from first principles such things shouldn't happen often, but it's also true that we have created a file that's more likely to silently trigger errors due to a mismatch between the header info and the actual properties of the file.
Again, no easy decision here, but overall, on reflection, I think the principle of maintaining the integrity of our files outweighs the arguments in favour of maintaining consistency with the Broad. I'm also inclined to weigh the arguments from Cas and Katie more strongly here, as this is a decision that falls more heavily (though certainly not entirely!) in the analysis domain. So overall, my call here is to keep the headers in our canonical, sharable CRAMs as pointing to the exact reference used in their alignment, and to only alter copies of these if it proves necessary for a specific analysis.
CRAMs aligned during the NAGIM project (corresponding Terra workspace) were using
dragen_functional_equivalence_mode
and hence used the masked DRAGEN reference atgs://gcp-public-data--broad-references/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta
: https://github.com/broadinstitute/warp/blob/a3b438422691f447032038be52c74639dafe914f/tasks/broad/DragmapAlignment.wdl#L58However, the subsequent
MergeBam
alignment step points to the unmasked reference atgs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
: https://github.com/broadinstitute/warp/blob/a3b438422691f447032038be52c74639dafe914f/tasks/broad/DragmapAlignment.wdl#L70Effectively, this means that read alignment used the masked reference, but the resulting CRAM header points to the unmasked reference. Ostensibly this was done to increase compatibility with CRAMs that were generated using non-DRAGEN workflows (to prevent MD5 reference mismatches as explained in https://github.com/populationgenomics/production-pipelines/pull/146), but we're still waiting for feedback from the Broad on the rationale: https://github.com/broadinstitute/warp/issues/871
While our production pipelines were switched over to use DRAGEN mode alignment and variant calling as well, the alignment steps used the masked reference consistently (including the merging step). Consequently, some of our CRAMs (the ones resulting from NAGIM, e.g. TOB-WGS) have the unmasked reference set in the header, while others (resulting from running our own pipeline) point to the masked reference. This inconsistency creates confusion (e.g. https://centrepopgen.slack.com/archives/C030X7WGFCL/p1669180382708679) and is likely to also create downstream problems when e.g. trying to federate CRAMs with gnomAD down the track.
Note that variant calling results (i.e. GVCFs) are not affected by this. In either case, the actual alignment of reads has been performed on the masked reference.
Proposal of steps to result in a consistent set of CRAMs: