uclahs-cds / pipeline-recalibrate-BAM

Nextflow pipeline to perform Indel Realignment and Base Quality Score Recalibration
https://uclahs-cds.github.io/pipeline-recalibrate-BAM/
GNU General Public License v2.0
1 stars 1 forks source link

Make output deterministic, add NFTest case #36

Closed nwiltsie closed 11 months ago

nwiltsie commented 11 months ago

This PR makes one change to the execution logic of this pipeline: the run_MergeSamFiles_Picard process now sorts the input BAMs before merging them. This ensures that the output is byte-for-byte deterministic and easier to test.

In addition to that change I added one NFTest case to run an A-mini BAM through the pipeline. @yashpatel6, I duplicated your test inputs from /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/yashpatel-use-recal-table/ - please let me know if there'd be something more appropriate.

I did copy /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/input/yaml/single_test_input.yaml into this repository (although the referenced BAM remains on /hot/); logistically that feels less fragile to me than pointing to an unversioned text file.

Infrastructure-wise, I created the following expected output directory, which is referenced in nftest.yml:

/hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai.sha512
└── BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.sha512

~As the BAMs are now deterministic I also am only comparing the .sha512 files in the NFTest case. That feels a little weird, but as we want to test that the hashfiles are created successfully and they depend upon the BAMs/BAIs it seems reasonable.~ As of c7a86b6 the test verifies all four of these files are created successfully.

Testing Results

$ tail /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/log-nftest-20231129T184234Z.log
2023-11-29 18:52:11,234 - NextFlow - INFO -
2023-11-29 18:52:11,343 - NFTest - DEBUG - md5 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/A-mini-n2/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/GATK-4.2.4.1/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam
2023-11-29 18:52:23,732 - NFTest - DEBUG - Assertion passed
2023-11-29 18:52:23,735 - NFTest - DEBUG - md5 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/A-mini-n2/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/GATK-4.2.4.1/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai
2023-11-29 18:52:23,824 - NFTest - DEBUG - Assertion passed
2023-11-29 18:52:23,827 - NFTest - DEBUG - md5 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/A-mini-n2/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/GATK-4.2.4.1/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.sha512 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.sha512
2023-11-29 18:52:23,832 - NFTest - DEBUG - Assertion passed
2023-11-29 18:52:23,834 - NFTest - DEBUG - md5 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/nwiltsie_add_nftest/A-mini-n2/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/GATK-4.2.4.1/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai.sha512 /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/output/BWA-MEM2-2.2.1_GATK-4.2.4.1_A-mini_S2-v1.1.5.bam.bai.sha512
2023-11-29 18:52:23,837 - NFTest - DEBUG - Assertion passed
2023-11-29 18:52:23,837 - NFTest - INFO -  [ succeed ]

Checklist

yashpatel6 commented 11 months ago

The test sample is totally fine! It's interesting that the order is the only thing that was throwing off the checksum; it looks like the order of the files to be merged determines the order of the @PG lines in the header so sorting always has the header in the same order, which is good!

It does feel a little odd to be comparing checksums of checksum files, though it may be clearer in terms of newer users working on the pipeline if the checksums of the actual BAM files were compared? The file should be small enough to not require too much time to compute the checksums

nwiltsie commented 11 months ago

It does feel a little odd to be comparing checksums of checksum files, though it may be clearer in terms of newer users working on the pipeline if the checksums of the actual BAM files were compared? The file should be small enough to not require too much time to compute the checksums

Yeah, there's weirdness either way... I think the most correct thing to do would be to test all of the files. That way we're explicitly checking that all of the files are correctly created. I'll push those changes.

nwiltsie commented 11 months ago

sorting always has the header in the same order, which is good!

Ah, yes, I forgot to say that the files are lexicographically sorted - so we get 1, 10, 11, ..., 19, 2, 20, ..., 22, 3, ... 9, M, X, Y. As you said, all that matters is that we get the same order.

tyamaguchi-ucla commented 11 months ago

The test sample is totally fine! It's interesting that the order is the only thing that was throwing off the checksum; it looks like the order of the files to be merged determines the order of the @PG lines in the header so sorting always has the header in the same order, which is good!

It does feel a little odd to be comparing checksums of checksum files, though it may be clearer in terms of newer users working on the pipeline if the checksums of the actual BAM files were compared? The file should be small enough to not require too much time to compute the checksums

I was thinking about this too. So, is there no stochasticity in IR and BQSR, or does it depend on the sample?

Since we use -SORT_ORDER coordinate, the command itself shouldn’t affect the alignment. On a related note, we might want to consider managing checksums for the alignment (i.e. BAM without its header) eventually. samtools stats can generate these checksums, and Sorel's currently incorporating it into the SQC pipeline.

yashpatel6 commented 11 months ago

I was thinking about this too. So, is there no stochasticity in IR and BQSR, or does it depend on the sample?

Generally, at least with the test samples, the IR and BQSR steps seem to be deterministic in terms of the recalibration table generated and the the IR targets identified; I'm not sure how that would scale up to larger or different samples though.

Since we use -SORT_ORDER coordinate, the command itself shouldn’t affect the alignment. On a related note, we might want to consider managing checksums for the alignment (i.e. BAM without its header) eventually. samtools stats can generate these checksums, and Sorel's currently incorporating it into the SQC pipeline.

That would makes sense, the alignment itself should remain the same and having a separate checksum for just the alignment would be useful to have