nwiltsie commented 11 months ago

This PR makes one change to the execution logic of this pipeline: the run_MergeSamFiles_Picard process now sorts the input BAMs before merging them. This ensures that the output is byte-for-byte deterministic and easier to test.

In addition to that change I added one NFTest case to run an A-mini BAM through the pipeline. @yashpatel6, I duplicated your test inputs from /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/yashpatel-use-recal-table/ - please let me know if there'd be something more appropriate.

I did copy /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/input/yaml/single_test_input.yaml into this repository (although the referenced BAM remains on /hot/); logistically that feels less fragile to me than pointing to an unversioned text file.

Infrastructure-wise, I created the following expected output directory, which is referenced in nftest.yml:

/hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai.sha512
└── BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.sha512

~As the BAMs are now deterministic I also am only comparing the .sha512 files in the NFTest case. That feels a little weird, but as we want to test that the hashfiles are created successfully and they depend upon the BAMs/BAIs it seems reasonable.~ As of c7a86b6 the test verifies all four of these files are created successfully.

Testing Results

$ tail /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/log-nftest-20231129T184234Z.log
2023-11-29 18:52:11,234 - NextFlow - INFO -
2023-11-29 18:52:11,343 - NFTest - DEBUG - md5 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/A-mini-n2/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/GATK-4.2.4.1/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam
2023-11-29 18:52:23,732 - NFTest - DEBUG - Assertion passed
2023-11-29 18:52:23,735 - NFTest - DEBUG - md5 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/A-mini-n2/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/GATK-4.2.4.1/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai
2023-11-29 18:52:23,824 - NFTest - DEBUG - Assertion passed
2023-11-29 18:52:23,827 - NFTest - DEBUG - md5 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/A-mini-n2/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/GATK-4.2.4.1/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.sha512 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.sha512
2023-11-29 18:52:23,832 - NFTest - DEBUG - Assertion passed
2023-11-29 18:52:23,834 - NFTest - DEBUG - md5 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/A-mini-n2/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/GATK-4.2.4.1/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai.sha512 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai.sha512
2023-11-29 18:52:23,837 - NFTest - DEBUG - Assertion passed
2023-11-29 18:52:23,837 - NFTest - INFO -  [ succeed ]

Checklist

[x] I have read the code review guidelines and the code review best practice on GitHub check-list.
[x] I have reviewed the Nextflow pipeline standards.
[x] The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
[x] I have set up or verified the branch protection rule following the github standards before opening this pull request.
[ ] I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed already, or do not wish to be listed. (This acknowledgement is optional.)
[x] I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
[ ] I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)
[x] I have tested the pipeline on at least one A-mini sample.

yashpatel6 commented 11 months ago

The test sample is totally fine! It's interesting that the order is the only thing that was throwing off the checksum; it looks like the order of the files to be merged determines the order of the @PG lines in the header so sorting always has the header in the same order, which is good!

It does feel a little odd to be comparing checksums of checksum files, though it may be clearer in terms of newer users working on the pipeline if the checksums of the actual BAM files were compared? The file should be small enough to not require too much time to compute the checksums

nwiltsie commented 11 months ago

It does feel a little odd to be comparing checksums of checksum files, though it may be clearer in terms of newer users working on the pipeline if the checksums of the actual BAM files were compared? The file should be small enough to not require too much time to compute the checksums

Yeah, there's weirdness either way... I think the most correct thing to do would be to test all of the files. That way we're explicitly checking that all of the files are correctly created. I'll push those changes.

nwiltsie commented 11 months ago

sorting always has the header in the same order, which is good!

Ah, yes, I forgot to say that the files are lexicographically sorted - so we get 1, 10, 11, ..., 19, 2, 20, ..., 22, 3, ... 9, M, X, Y. As you said, all that matters is that we get the same order.

tyamaguchi-ucla commented 11 months ago

The test sample is totally fine! It's interesting that the order is the only thing that was throwing off the checksum; it looks like the order of the files to be merged determines the order of the @PG lines in the header so sorting always has the header in the same order, which is good!

It does feel a little odd to be comparing checksums of checksum files, though it may be clearer in terms of newer users working on the pipeline if the checksums of the actual BAM files were compared? The file should be small enough to not require too much time to compute the checksums

I was thinking about this too. So, is there no stochasticity in IR and BQSR, or does it depend on the sample?

Since we use -SORT_ORDER coordinate, the command itself shouldn’t affect the alignment. On a related note, we might want to consider managing checksums for the alignment (i.e. BAM without its header) eventually. samtools stats can generate these checksums, and Sorel's currently incorporating it into the SQC pipeline.

yashpatel6 commented 11 months ago

I was thinking about this too. So, is there no stochasticity in IR and BQSR, or does it depend on the sample?

Generally, at least with the test samples, the IR and BQSR steps seem to be deterministic in terms of the recalibration table generated and the the IR targets identified; I'm not sure how that would scale up to larger or different samples though.

Since we use -SORT_ORDER coordinate, the command itself shouldn’t affect the alignment. On a related note, we might want to consider managing checksums for the alignment (i.e. BAM without its header) eventually. samtools stats can generate these checksums, and Sorel's currently incorporating it into the SQC pipeline.

That would makes sense, the alignment itself should remain the same and having a separate checksum for just the alignment would be useful to have

uclahs-cds / pipeline-recalibrate-BAM

Make output deterministic, add NFTest case #36

Testing Results

Checklist