new File breaks cloud execution in bcl_demultiplex

k1sauce commented 6 months ago

Have you checked the docs?

Description of the bug

This bug relates to the bcl_demultiplex sub workflow.

new File will not work with cloud storage like s3, see https://github.com/nf-core/tools/pull/354 for reference, need to use file instead.

Command used and terminal output

No response

Relevant files

No response

System information

No response

glichtenstein commented 6 months ago

@k1sauce @Aratz @edmundmiller Hey, how are you doing?

An idea to solve this issue is to have a parameter called --log_skipped_fastqs so that the user can decide if the file gets created or not.

nextflow run glichtenstein/demultiplex -r 3806b6a \
  --input nf-core-samplesheet.csv \
  --outdir /scratch/output \
  -profile docker \
  -work-dir /scratch/workdir \
  --log_skipped_fastqs=false

if false, the file named skipped_fastqs.log wont be created, if set to true (the default) the file is created in the outdir path. The invalid fastqs will be skipped regardless, so that the workflow can continue with falco.

what do you think? I tested it and opened two pull requests for the demultiplex.nf and the module counterpart:

nf-core/modules/pull/5638

nf-core/demultiplex/pull/190

glichtenstein commented 6 months ago

@k1sauce how are you doing? would it be possible to ask you ro run a test with the branch I am working on, to check if it fixes the s3 issue? The idea behind it, is that one can set a flag: --log_skipped_fastqs=false so that the log file is not created. The empty fastqs will still be skipped so the workflow can complete, but the log file containing a list of invalid fastqs wont be created in the --outdir path. Currently, I cannot test in a cloud aws s3 environment. I've left you an example in the PR https://github.com/nf-core/demultiplex/pull/190, you may run it using the latest commit revision id (fffb21d). I.e.,

Testing

true nextflow run glichtenstein/demultiplex -r 7d9538e --input nf-core-samplesheet.csv --outdir /data/scratch/iseq-DI/output -profile docker -work-dir /data/scratch/iseq-DI/workdir -resume --skip_tools fastp --log_empty_fastqs=true false nextflow run glichtenstein/demultiplex -r 7d9538e --input nf-core-samplesheet.csv --outdir /data/scratch/iseq-DI/output -profile docker -work-dir /data/scratch/iseq-DI/workdir -resume --skip_tools fastp --log_empty_fastqs=false

k1sauce commented 6 months ago

@glichtenstein Thanks working on this. I spent some time thinking more about this too and I think I would like to tackle it a different way, here are my thoughts:

I do not think we should be writing a log file in the sub workflow. I think it does not fit with the idea of a sub workflow. As a user, I think you expect everything that gets written in a sub workflow to come from a process publishDir. Also I think anything that is generated by the sub workflow would usually be available via the workflow emit. That way the user can decide what they want to do with those emits.
I think the bool in the meta variable to signal to downstream applications that the file size is 0 is not necessary either (sorry for suggesting it). We should probably just pass along all files and let the downstream consumers of the channel implement their own filters. This makes sense too when you mix/merge the downstream FASTQ channels and want to filter on the file size. The alternative of calculating the files size in the sub workflow will lead to code duplication across the demux sub workflows.
Perhaps this is a different issue, but I also do not approve of the way the read group parsing is done, as it will stage all the files on the head node. I think that if read group parsing is done at all it should be done in a process. Running things on that head node means that all the FASTQ files need to be staged. That becomes a bottle neck when I have 1000s+ of FASTQ files that all need to be staged to the same node. (Also this code is duplicated across the demux sub workflows)

So that being said I also have a branch that I am testing out some changes on that would address this, if you are in agreement I can open a PR after a bit more development and we can review that one instead?

k1sauce commented 6 months ago

@glichtenstein I don't want to get ahead of myself but I am thinking of something simple like this


//
// Demultiplex Illumina BCL data using bcl-convert or bcl2fastq
//

include { BCLCONVERT } from "../../../modules/nf-core/bclconvert/main"
include { BCL2FASTQ  } from "../../../modules/nf-core/bcl2fastq/main"

workflow BCL_DEMULTIPLEX {
    take:
        ch_flowcell     // [[id:"", lane:""],samplesheet.csv, path/to/bcl/files]
        demultiplexer   // bclconvert or bcl2fastq

    main:
        ch_versions      = Channel.empty()
        ch_fastq         = Channel.empty()
        ch_reports       = Channel.empty()
        ch_stats         = Channel.empty()
        ch_interop       = Channel.empty()

        // Split flowcells into separate channels containing run as tar and run as path
        // https://nextflow.slack.com/archives/C02T98A23U7/p1650963988498929
        ch_flowcell
            .branch { meta, samplesheet, run ->
                tar: run.toString().endsWith(".tar.gz")
                dir: true
            }.set { ch_flowcells }

        ch_flowcells.tar
            .multiMap { meta, samplesheet, run ->
                samplesheets: [ meta, samplesheet ]
                run_dirs: [ meta, run ]
            }.set { ch_flowcells_tar }

        // Runs when run_dir is a tar archive
        // Re-join the metadata and the untarred run directory with the samplesheet
        ch_flowcells_tar_merged = ch_flowcells_tar
                                    .samplesheets
                                    .join( ch_flowcells_tar.run_dirs )

        // Merge the two channels back together
        ch_flowcells = ch_flowcells.dir.mix(ch_flowcells_tar_merged)

        // MODULE: bclconvert
        // Demultiplex the bcl files
        if (demultiplexer == "bclconvert") {
            BCLCONVERT( ch_flowcells )
            ch_fastq    = ch_fastq.mix(BCLCONVERT.out.fastq)
            ch_interop  = ch_interop.mix(BCLCONVERT.out.interop)
            ch_reports  = ch_reports.mix(BCLCONVERT.out.reports)
            ch_versions = ch_versions.mix(BCLCONVERT.out.versions)
        }

        // MODULE: bcl2fastq
        // Demultiplex the bcl files
        if (demultiplexer == "bcl2fastq") {
            BCL2FASTQ( ch_flowcells )
            ch_fastq    = ch_fastq.mix(BCL2FASTQ.out.fastq)
            ch_interop  = ch_interop.mix(BCL2FASTQ.out.interop)
            ch_reports  = ch_reports.mix(BCL2FASTQ.out.reports)
            ch_stats    = ch_stats.mix(BCL2FASTQ.out.stats)
            ch_versions = ch_versions.mix(BCL2FASTQ.out.versions)
        }

        // Generate meta for each fastq
        ch_fastq_with_meta = ch_fastq
            .map{ fc_meta, fastq -> 
                def meta = [
                    "id": fastq.getSimpleName().toString() - ~/_R[0-9]_001.*$/,
                    "samplename": fastq.getSimpleName().toString() - ~/_S[0-9]+.*$/,
                    "fcid": fc_meta.id,
                    "lane": fc_meta.lane,
                    "empty": (fastq.size() == 0)
                ]
                [meta, fastq] 
            }
            .groupTuple(by: [0])
            .map { meta, fastq -> // Add meta.single_end
                meta.single_end = fastq.size() == 1
                return [meta, fastq.flatten()]
            }

    emit:
        fastq    = ch_fastq_with_meta
        reports  = ch_reports
        stats    = ch_stats
        interop  = ch_interop
        versions = ch_versions
}

k1sauce commented 6 months ago

@glichtenstein Ok how does this look https://github.com/nf-core/modules/pull/5720

If it's ok with you I think this may be the right approach. Then we can add a filter on file size before Falco in the demultiplex workflow.

glichtenstein commented 5 months ago

@k1sauce, how are you? I've work in this proposal to fix the new File() issue in PR 5781, using file() as suggested. If posible it needs to be tested in s3, as I do not have access to that environment. https://github.com/nf-core/modules/pull/5781. If this fixes it, we can look to refactor demultiplex.nf and look at the file staging issues in another branch if you would like.

glichtenstein commented 5 months ago

@matthdsm @k1sauce @SPPearce How are you guys, can we decide on which route to take to fix the reported issue? I've been using my branch so far when possible but if its gonna be drop and closed I want to know so that I can deliver reproducible results to the core cusotmers, my supervisor is not keen in using dev branches. So in brief, it may be fixed by https://github.com/nf-core/modules/pull/5720 or https://github.com/nf-core/modules/pull/5781

I am getting this error quite often when empty fastqs appear while running nf-core/demultiplex 1.4.1: ` executor > local (1) [3d/f5bc64] NFC…CLCONVERT (22F37MLT3.null) | 0 of 1 [- ] NFC…ULTIPLEX:DEMULTIPLEX:FALCO - [- ] NFC…LTIPLEX:DEMULTIPLEX:MD5SUM - [- ] NFC…USTOM_DUMPSOFTWAREVERSIONS - [- ] NFC…TIPLEX:DEMULTIPLEX:MULTIQC - ERROR ~ Cannot invoke method startsWith() on null object

-- Check script '/home/hw-m-sylvesterportal/.nextflow/assets/nf-core/demultiplex/./workflows/../subworkflows/nf-core/bcl_demultiplex/main.nf' at line: 125 or see '.nextflow.log' file for more details`

For the sake of time in production we usually end up running bclconvert in basespace to get the fastqs without falco nor multiqc to the customer when nf-core fails on us. So if we could know which are the file with wrong barcodes after the demux it could help is a lot. I see the point, if you dont like a new log file created, we could reach the multiqc and bclconvert reports to track the bad fastqs also, but the workflow needs to complete so we can get that info straight forward, right now one has to dwell into the workDir and search for the files manually.

k1sauce commented 5 months ago

@glichtenstein Hey, I have updated the test but now I need to debug the Github actions workflow. I am having trouble with the test in docker even thought the singularity test passes. You can take a look here and let me know if you have any ideas https://github.com/nf-core/modules/actions/runs/9782057279/job/27007581785?pr=5720

Also, when I run the docker test locally it passes, so I'm inferring it has something to do with the GitHub action

k1sauce commented 4 months ago

@glichtenstein should be ready to go now, see my comment here on why the test was failing. https://github.com/nf-core/modules/pull/5720#issuecomment-2215033408

nf-core / modules