Closed nwiltsie closed 11 months ago
The test sample is totally fine! It's interesting that the order is the only thing that was throwing off the checksum; it looks like the order of the files to be merged determines the order of the @PG
lines in the header so sorting always has the header in the same order, which is good!
It does feel a little odd to be comparing checksums of checksum files, though it may be clearer in terms of newer users working on the pipeline if the checksums of the actual BAM files were compared? The file should be small enough to not require too much time to compute the checksums
It does feel a little odd to be comparing checksums of checksum files, though it may be clearer in terms of newer users working on the pipeline if the checksums of the actual BAM files were compared? The file should be small enough to not require too much time to compute the checksums
Yeah, there's weirdness either way... I think the most correct thing to do would be to test all of the files. That way we're explicitly checking that all of the files are correctly created. I'll push those changes.
sorting always has the header in the same order, which is good!
Ah, yes, I forgot to say that the files are lexicographically sorted - so we get 1, 10, 11, ..., 19, 2, 20, ..., 22, 3, ... 9, M, X, Y. As you said, all that matters is that we get the same order.
The test sample is totally fine! It's interesting that the order is the only thing that was throwing off the checksum; it looks like the order of the files to be merged determines the order of the
@PG
lines in the header so sorting always has the header in the same order, which is good!It does feel a little odd to be comparing checksums of checksum files, though it may be clearer in terms of newer users working on the pipeline if the checksums of the actual BAM files were compared? The file should be small enough to not require too much time to compute the checksums
I was thinking about this too. So, is there no stochasticity in IR and BQSR, or does it depend on the sample?
Since we use -SORT_ORDER coordinate
, the command itself shouldn’t affect the alignment. On a related note, we might want to consider managing checksums for the alignment (i.e. BAM without its header) eventually. samtools stats
can generate these checksums, and Sorel's currently incorporating it into the SQC pipeline.
I was thinking about this too. So, is there no stochasticity in IR and BQSR, or does it depend on the sample?
Generally, at least with the test samples, the IR and BQSR steps seem to be deterministic in terms of the recalibration table generated and the the IR targets identified; I'm not sure how that would scale up to larger or different samples though.
Since we use
-SORT_ORDER coordinate
, the command itself shouldn’t affect the alignment. On a related note, we might want to consider managing checksums for the alignment (i.e. BAM without its header) eventually.samtools stats
can generate these checksums, and Sorel's currently incorporating it into the SQC pipeline.
That would makes sense, the alignment itself should remain the same and having a separate checksum for just the alignment would be useful to have
This PR makes one change to the execution logic of this pipeline: the
run_MergeSamFiles_Picard
process now sorts the input BAMs before merging them. This ensures that the output is byte-for-byte deterministic and easier to test.In addition to that change I added one NFTest case to run an A-mini BAM through the pipeline. @yashpatel6, I duplicated your test inputs from
/hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/yashpatel-use-recal-table/
- please let me know if there'd be something more appropriate.I did copy
/hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/input/yaml/single_test_input.yaml
into this repository (although the referenced BAM remains on/hot/
); logistically that feels less fragile to me than pointing to an unversioned text file.Infrastructure-wise, I created the following expected output directory, which is referenced in nftest.yml:
~As the BAMs are now deterministic I also am only comparing the
.sha512
files in the NFTest case. That feels a little weird, but as we want to test that the hashfiles are created successfully and they depend upon the BAMs/BAIs it seems reasonable.~ As of c7a86b6 the test verifies all four of these files are created successfully.Testing Results
Checklist
[x] I have read the code review guidelines and the code review best practice on GitHub check-list.
[x] I have reviewed the Nextflow pipeline standards.
[x] The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
[x] I have set up or verified the branch protection rule following the github standards before opening this pull request.
[ ] I have added my name to the contributors listings in the
manifest
block in thenextflow.config
as part of this pull request, am listed already, or do not wish to be listed. (This acknowledgement is optional.)[x] I have added the changes included in this pull request to the
CHANGELOG.md
under the next release version or unreleased, and updated the date.[ ] I have updated the version number in the
metadata.yaml
andmanifest
block of thenextflow.config
file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)[x] I have tested the pipeline on at least one A-mini sample.