seqeralabs / nf-sentieon

POC Nextflow pipeline to run Sentieon software
Mozilla Public License 2.0
5 stars 4 forks source link

Name collision in the SENTIEON_BWAMEM process #7

Closed DonFreed closed 2 years ago

DonFreed commented 2 years ago

Description of the bug

Running the example pipeline on a local machine results in the following error:

-[nf-sentieon] Pipeline completed with errors-
Error executing process > 'NF_SENTIEON:SENTIEON:SENTIEON_BWAMEM (NA12878.H88WKADXX.1.CGATGT-2)'

Caused by:
  Process `NF_SENTIEON:SENTIEON:SENTIEON_BWAMEM` input file name collision -- There are multiple input files for each of the following file names: genome.fa

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Command used and terminal output

$ nextflow run -resume ./main.nf --input assets/samplesheet_test_illumina.csv --genome GRCh37

Relevant files

nextflow.log

System information

DonFreed commented 2 years ago

I was finally able to track this down.

The igenomes GRCh37 WholeGenomeFasta folder provides the following files:

$ aws s3 ls s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/
2017-04-13 04:01:50       4237 GenomeSize.xml
2017-04-13 04:01:51       3950 genome.dict
2017-04-13 04:01:52 3147288982 genome.fa
2017-04-13 04:02:00        714 genome.fa.fai
2017-04-13 04:02:00      49152 genome.fa.index

While the BWAIndex provides the following:

$ aws s3 ls s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/

                           PRE version0.5.x/
                           PRE version0.6.0/
2017-04-13 03:42:51 3147288982 genome.fa
2017-04-13 03:45:40       6563 genome.fa.amb
2017-04-13 03:45:41        870 genome.fa.ann
2017-04-13 03:45:41 3095694072 genome.fa.bwt
2017-04-13 03:45:58  773923497 genome.fa.pac
2017-04-13 03:46:11 1547847040 genome.fa.sa

The igenomes config has:

params {
    // illumina iGenomes reference file paths
    genomes {
        'GRCh37' {
            fasta       = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa"
            bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"

Both the fasta and index input in BWA pull in the "genome.fa" file in the SENTIEON_BWAMEM process, leading to the name collision.

The bwa key in the igenomes config can be updated to stage the directory containing the reference index, rather than just the fasta file. This also makes the igenomes setting more consistent with the SENTIEON_BWAINDEX process; both produce a directory containing bwa index files for the reference genome.