Closed k1sauce closed 4 months ago
@k1sauce @Aratz @edmundmiller Hey, how are you doing?
An idea to solve this issue is to have a parameter called --log_skipped_fastqs so that the user can decide if the file gets created or not.
nextflow run glichtenstein/demultiplex -r 3806b6a \
--input nf-core-samplesheet.csv \
--outdir /scratch/output \
-profile docker \
-work-dir /scratch/workdir \
--log_skipped_fastqs=false
if false, the file named skipped_fastqs.log
wont be created, if set to true (the default) the file is created in the outdir path.
The invalid fastqs will be skipped regardless, so that the workflow can continue with falco.
what do you think? I tested it and opened two pull requests for the demultiplex.nf and the module counterpart:
@k1sauce how are you doing? would it be possible to ask you ro run a test with the branch I am working on, to check if it fixes the s3 issue? The idea behind it, is that one can set a flag: --log_skipped_fastqs=false
so that the log file is not created. The empty fastqs will still be skipped so the workflow can complete, but the log file containing a list of invalid fastqs wont be created in the --outdir
path. Currently, I cannot test in a cloud aws s3 environment. I've left you an example in the PR https://github.com/nf-core/demultiplex/pull/190, you may run it using the latest commit revision id (fffb21d). I.e.,
true
nextflow run glichtenstein/demultiplex -r 7d9538e --input nf-core-samplesheet.csv --outdir /data/scratch/iseq-DI/output -profile docker -work-dir /data/scratch/iseq-DI/workdir -resume --skip_tools fastp --log_empty_fastqs=true
false
nextflow run glichtenstein/demultiplex -r 7d9538e --input nf-core-samplesheet.csv --outdir /data/scratch/iseq-DI/output -profile docker -work-dir /data/scratch/iseq-DI/workdir -resume --skip_tools fastp --log_empty_fastqs=false
@glichtenstein Thanks working on this. I spent some time thinking more about this too and I think I would like to tackle it a different way, here are my thoughts:
So that being said I also have a branch that I am testing out some changes on that would address this, if you are in agreement I can open a PR after a bit more development and we can review that one instead?
@glichtenstein I don't want to get ahead of myself but I am thinking of something simple like this
//
// Demultiplex Illumina BCL data using bcl-convert or bcl2fastq
//
include { BCLCONVERT } from "../../../modules/nf-core/bclconvert/main"
include { BCL2FASTQ } from "../../../modules/nf-core/bcl2fastq/main"
workflow BCL_DEMULTIPLEX {
take:
ch_flowcell // [[id:"", lane:""],samplesheet.csv, path/to/bcl/files]
demultiplexer // bclconvert or bcl2fastq
main:
ch_versions = Channel.empty()
ch_fastq = Channel.empty()
ch_reports = Channel.empty()
ch_stats = Channel.empty()
ch_interop = Channel.empty()
// Split flowcells into separate channels containing run as tar and run as path
// https://nextflow.slack.com/archives/C02T98A23U7/p1650963988498929
ch_flowcell
.branch { meta, samplesheet, run ->
tar: run.toString().endsWith(".tar.gz")
dir: true
}.set { ch_flowcells }
ch_flowcells.tar
.multiMap { meta, samplesheet, run ->
samplesheets: [ meta, samplesheet ]
run_dirs: [ meta, run ]
}.set { ch_flowcells_tar }
// Runs when run_dir is a tar archive
// Re-join the metadata and the untarred run directory with the samplesheet
ch_flowcells_tar_merged = ch_flowcells_tar
.samplesheets
.join( ch_flowcells_tar.run_dirs )
// Merge the two channels back together
ch_flowcells = ch_flowcells.dir.mix(ch_flowcells_tar_merged)
// MODULE: bclconvert
// Demultiplex the bcl files
if (demultiplexer == "bclconvert") {
BCLCONVERT( ch_flowcells )
ch_fastq = ch_fastq.mix(BCLCONVERT.out.fastq)
ch_interop = ch_interop.mix(BCLCONVERT.out.interop)
ch_reports = ch_reports.mix(BCLCONVERT.out.reports)
ch_versions = ch_versions.mix(BCLCONVERT.out.versions)
}
// MODULE: bcl2fastq
// Demultiplex the bcl files
if (demultiplexer == "bcl2fastq") {
BCL2FASTQ( ch_flowcells )
ch_fastq = ch_fastq.mix(BCL2FASTQ.out.fastq)
ch_interop = ch_interop.mix(BCL2FASTQ.out.interop)
ch_reports = ch_reports.mix(BCL2FASTQ.out.reports)
ch_stats = ch_stats.mix(BCL2FASTQ.out.stats)
ch_versions = ch_versions.mix(BCL2FASTQ.out.versions)
}
// Generate meta for each fastq
ch_fastq_with_meta = ch_fastq
.map{ fc_meta, fastq ->
def meta = [
"id": fastq.getSimpleName().toString() - ~/_R[0-9]_001.*$/,
"samplename": fastq.getSimpleName().toString() - ~/_S[0-9]+.*$/,
"fcid": fc_meta.id,
"lane": fc_meta.lane,
"empty": (fastq.size() == 0)
]
[meta, fastq]
}
.groupTuple(by: [0])
.map { meta, fastq -> // Add meta.single_end
meta.single_end = fastq.size() == 1
return [meta, fastq.flatten()]
}
emit:
fastq = ch_fastq_with_meta
reports = ch_reports
stats = ch_stats
interop = ch_interop
versions = ch_versions
}
@glichtenstein Ok how does this look https://github.com/nf-core/modules/pull/5720
If it's ok with you I think this may be the right approach. Then we can add a filter on file size before Falco in the demultiplex workflow.
@k1sauce, how are you? I've work in this proposal to fix the new File()
issue in PR 5781, using file()
as suggested. If posible it needs to be tested in s3, as I do not have access to that environment. https://github.com/nf-core/modules/pull/5781. If this fixes it, we can look to refactor demultiplex.nf and look at the file staging issues in another branch if you would like.
@matthdsm @k1sauce @SPPearce How are you guys, can we decide on which route to take to fix the reported issue? I've been using my branch so far when possible but if its gonna be drop and closed I want to know so that I can deliver reproducible results to the core cusotmers, my supervisor is not keen in using dev branches. So in brief, it may be fixed by https://github.com/nf-core/modules/pull/5720 or https://github.com/nf-core/modules/pull/5781
I am getting this error quite often when empty fastqs appear while running nf-core/demultiplex 1.4.1: ` executor > local (1) [3d/f5bc64] NFC…CLCONVERT (22F37MLT3.null) | 0 of 1 [- ] NFC…ULTIPLEX:DEMULTIPLEX:FALCO - [- ] NFC…LTIPLEX:DEMULTIPLEX:MD5SUM - [- ] NFC…USTOM_DUMPSOFTWAREVERSIONS - [- ] NFC…TIPLEX:DEMULTIPLEX:MULTIQC - ERROR ~ Cannot invoke method startsWith() on null object
-- Check script '/home/hw-m-sylvesterportal/.nextflow/assets/nf-core/demultiplex/./workflows/../subworkflows/nf-core/bcl_demultiplex/main.nf' at line: 125 or see '.nextflow.log' file for more details`
For the sake of time in production we usually end up running bclconvert in basespace to get the fastqs without falco nor multiqc to the customer when nf-core fails on us. So if we could know which are the file with wrong barcodes after the demux it could help is a lot. I see the point, if you dont like a new log file created, we could reach the multiqc and bclconvert reports to track the bad fastqs also, but the workflow needs to complete so we can get that info straight forward, right now one has to dwell into the workDir and search for the files manually.
@glichtenstein Hey, I have updated the test but now I need to debug the Github actions workflow. I am having trouble with the test in docker even thought the singularity test passes. You can take a look here and let me know if you have any ideas https://github.com/nf-core/modules/actions/runs/9782057279/job/27007581785?pr=5720
Also, when I run the docker test locally it passes, so I'm inferring it has something to do with the GitHub action
@glichtenstein should be ready to go now, see my comment here on why the test was failing. https://github.com/nf-core/modules/pull/5720#issuecomment-2215033408
Have you checked the docs?
Description of the bug
This bug relates to the bcl_demultiplex sub workflow.
new File
will not work with cloud storage like s3, see https://github.com/nf-core/tools/pull/354 for reference, need to usefile
instead.Command used and terminal output
No response
Relevant files
No response
System information
No response