RFC: new subworkflow illumina_demultiplex

matthdsm commented 2 years ago

I'd like to suggest a new subworkflow which demultiplexes a sequencer output directory to fastq's and performs basic QC.

flowchart LR
    RUN_DIR([Run Directory]) & SAMPLESHEET([SampleSheet.csv])  --"per lane"-->      BCLCONVERT(bcl-convert)
    BCLCONVERT                                                  -->                 FASTQ([Raw fastq]) & DEMUX_STATS([Per lane demultiplexing stats])
    META([Metadata.csv]) & FASTQ                           -->                 FQ_META([Meta map with fastq paths])
    DEMUX_STATS                           -->                 MULTIQC(MultiQC) --> MQC_REPORT([MultiQC Report])

[x] This module does not exist yet with the nf-core modules list command
[x] There is no open pull request for this module
[x] There is no open issue for this module
[x] If I'm planning to work on this module, I added myself to the Assignees to facilitate tracking who is working on the module

Related to #1484 and #1485

EDIT: Updated flowchart

Midnighter commented 2 years ago

I would suggest to split the two. There are many sources for FASTQ files, for example, public archives so it makes sense to have a separate subworkflow for QC that takes FASTQ as input, IMO. Take a look at what we've done in taxprofiler https://github.com/nf-core/taxprofiler/blob/dev/subworkflows/local/shortread_preprocessing.nf

matthdsm commented 2 years ago

Makes sense to me. I could open another RFC for a fastq_qc and fastq_postprocessing subworkflow

edmundmiller commented 2 years ago

Might also see what nanoseq is doing. But I agree with the fastq_qc and fastq_postprocessing. I'm going to add demultiplexing for Element soon as well so one qc workflow to rule them all is the way.

matthdsm commented 2 years ago

Since bclconvert got merged, I can get cracking on this. There's still a few details I'd like to iron out.

Do we take 1 samplesheet per lane and extrapolate the number of lanes from the number of available samplesheets?
Do we take 1 samplesheet for the whole run, containing all samples on the flowcell and use this for all lanes?
How do we handle metadata? I'd suggest adding a second meta.csv and merging the correct fastq's with the records in the meta file.
How do we handle the metadata with multiple fastq's (e.g. per lane) for one sample? All examples I've seen assume a single fastq/read

edmundmiller commented 2 years ago

I've been doing 2 with our bclconvert equivalent(called bases2fastq).

I could see 1 being better on a cluster and letting each node process a lane. But I'm personally afraid we'll get really hacky and possibly break stuff.

So a column inside the samplesheet.csv(nf-core) one, that then points to two csvs per run? That's going to be a lot of inputs to manage. I'd vote we keep most of it with the sample name, and add extra columns if there's anything else we need.
I think we just spit them out and let the downstream pipeline handle them how they want. IE nf-core/rnaseq will cat them together.

This is going to go in nf-core/demultiplex correct? I see the fastp and fastqc being a subworkflow, but I was imagining we'll just feed bcl2fastq, bclconvert and bases2fastq into that subworkflow. I think those are probably fine just being modules by themselves.

matthdsm commented 2 years ago

Hi @Emiller88, Thanks for the comment.

Currently we have a WDL workflow that demultiplexes our NovaSeq data following suggestion one. It requires some setup initially, but once it's there it's pretty much compatible with most (if not all) Illumina sequencer output.
Seems the most hacky way to me, since we minimise efficiency, and it doesn't account for the case where different lanes have separate pools, which we have known to cause problems in this situation
I'd keep the (nf-core ) metadata completely separate from the samplesheet, since the metadata is related to samples and not the flowcell. Perhaps we can introduce 2 (nf-core) metadata samplesheets. One with data per lane and one with sample metadata, which is then "merged" with the resulting fastq's.
Agreed, this might be the best way to go.

As for the QC steps, that was just the "dream" scenario. We've sketched out where we would like to go with this and the schema was a small part of it. In an ideal world we can fast track this into nf-core/demultiplex so we're able to use it downstream. If not I might have to focus on getting a functional prototype in house first before trying to get everything merged into nf-core

matthdsm commented 2 years ago

flowchart LR
LANE_META(["Lane metadata (number + samplesheet)"]) --"per lane"--> bcl-convert
Flowcell --"per lane"--> bcl-convert
bcl-convert --> fastq([fastq])
SAMPLE_META(["Sample metadata (id, basename, rg, tags, ...)"]) & fastq --> NF_CORE_META([nf-core samplesheet with metadata & fastq paths for downstream wf's ])

edmundmiller commented 2 years ago

In an ideal world we can fast track this into nf-core/demultiplex so we're able to use it downstream

I think this is what we need to do! I think we can come back and add the subworkflows after they've come up naturally from redoing nf-core/demultiplex.

I did the template update https://github.com/nf-core/demultiplex/commit/474239ef8d5f3e24cfbcc965a1eb7e026c7f1aad And I've got a PR that just pulled everything out from the v1 into DSL2. We can toss them out and refactor from there. https://github.com/nf-core/demultiplex/pull/28

So feel free to open a PR using the bcl-convert module, and we can see where the chips fall.

The main thing we'll need as well is test data for bcl-convert that can be run by GitHub actions. I've got test data ready to release for bases2fastq that runs on minimal resources.

matthdsm commented 2 years ago

Great to see there's already a base to work with! Do you want me to push more stuff into your dsl2 PR or do I open a new one against dev? Personally, I'd go for the "move fast and break things" approach for the dev branch, as suggested on slack, so we can start "fresh" from a dsl2-ish base.

I'll check out your PR monday (or earlier if I find the time) en drop a review so we can start merging.

matthdsm commented 2 years ago

Is there somewhere we can host the test data other than github? is gh LFS an option here? I suppose I can source the raw data from a miseq run somewhere to test bcl2fastq/bcl-convert

edmundmiller commented 2 years ago

Do you want me to push more stuff into your dsl2 PR or do I open a new one against dev?

Open a new one against dev! That was just to get rid of the random processes. I'm also 100% sure it won't run, so we can delete all the sections of the workflow and start fresh. With DSL2 we may not have to do so many workarounds with the sample sheets.

I'd go for the "move fast and break things" approach for the dev branch

Agreed! Luckily, it doesn't seem like anyone is depending on it right now, or the beauty of nextflow is they can easily use v1.

Is there somewhere we can host the test data other than github? is gh LFS an option here?

https://github.com/nf-core/test-datasets is usually the preferred place. We can just toss out some data and cut it down to a few tiles. So we shouldn't need LFS, since we don't really care about the speed of git on that repo.

matthdsm commented 2 years ago

I there's still interest in this, I've made a (small) subworkflow here which we're currently using in testing. All required modules are up to date in nf-core, so it might be useful for others.

nf-core / modules

RFC: new subworkflow illumina_demultiplex #1515