Closed matthdsm closed 2 years ago
I would suggest to split the two. There are many sources for FASTQ files, for example, public archives so it makes sense to have a separate subworkflow for QC that takes FASTQ as input, IMO. Take a look at what we've done in taxprofiler https://github.com/nf-core/taxprofiler/blob/dev/subworkflows/local/shortread_preprocessing.nf
Makes sense to me. I could open another RFC for a fastq_qc
and fastq_postprocessing
subworkflow
Might also see what nanoseq is doing. But I agree with the fastq_qc and fastq_postprocessing. I'm going to add demultiplexing for Element soon as well so one qc workflow to rule them all is the way.
Since bclconvert
got merged, I can get cracking on this. There's still a few details I'd like to iron out.
meta.csv
and merging the correct fastq's with the records in the meta file.I've been doing 2 with our bclconvert
equivalent(called bases2fastq
).
I could see 1 being better on a cluster and letting each node process a lane. But I'm personally afraid we'll get really hacky and possibly break stuff.
So a column inside the samplesheet.csv
(nf-core) one, that then points to two csvs per run? That's going to be a lot of inputs to manage. I'd vote we keep most of it with the sample name, and add extra columns if there's anything else we need.
I think we just spit them out and let the downstream pipeline handle them how they want. IE nf-core/rnaseq
will cat them together.
This is going to go in nf-core/demultiplex correct? I see the fastp
and fastqc
being a subworkflow, but I was imagining we'll just feed bcl2fastq
, bclconvert
and bases2fastq
into that subworkflow. I think those are probably fine just being modules by themselves.
Hi @Emiller88, Thanks for the comment.
Currently we have a WDL workflow that demultiplexes our NovaSeq data following suggestion one. It requires some setup initially, but once it's there it's pretty much compatible with most (if not all) Illumina sequencer output.
Seems the most hacky way to me, since we minimise efficiency, and it doesn't account for the case where different lanes have separate pools, which we have known to cause problems in this situation
I'd keep the (nf-core ) metadata completely separate from the samplesheet, since the metadata is related to samples and not the flowcell. Perhaps we can introduce 2 (nf-core) metadata samplesheets. One with data per lane and one with sample metadata, which is then "merged" with the resulting fastq's.
Agreed, this might be the best way to go.
As for the QC steps, that was just the "dream" scenario. We've sketched out where we would like to go with this and the schema was a small part of it. In an ideal world we can fast track this into nf-core/demultiplex
so we're able to use it downstream. If not I might have to focus on getting a functional prototype in house first before trying to get everything merged into nf-core
flowchart LR
LANE_META(["Lane metadata (number + samplesheet)"]) --"per lane"--> bcl-convert
Flowcell --"per lane"--> bcl-convert
bcl-convert --> fastq([fastq])
SAMPLE_META(["Sample metadata (id, basename, rg, tags, ...)"]) & fastq --> NF_CORE_META([nf-core samplesheet with metadata & fastq paths for downstream wf's ])
In an ideal world we can fast track this into nf-core/demultiplex so we're able to use it downstream
I think this is what we need to do! I think we can come back and add the subworkflows after they've come up naturally from redoing nf-core/demultiplex
.
I did the template update https://github.com/nf-core/demultiplex/commit/474239ef8d5f3e24cfbcc965a1eb7e026c7f1aad And I've got a PR that just pulled everything out from the v1 into DSL2. We can toss them out and refactor from there. https://github.com/nf-core/demultiplex/pull/28
So feel free to open a PR using the bcl-convert
module, and we can see where the chips fall.
The main thing we'll need as well is test data for bcl-convert
that can be run by GitHub actions. I've got test data ready to release for bases2fastq that runs on minimal resources.
Great to see there's already a base to work with! Do you want me to push more stuff into your dsl2 PR or do I open a new one against dev
?
Personally, I'd go for the "move fast and break things" approach for the dev branch, as suggested on slack, so we can start "fresh" from a dsl2-ish base.
I'll check out your PR monday (or earlier if I find the time) en drop a review so we can start merging.
Is there somewhere we can host the test data other than github? is gh LFS an option here? I suppose I can source the raw data from a miseq run somewhere to test bcl2fastq/bcl-convert
Do you want me to push more stuff into your dsl2 PR or do I open a new one against dev?
Open a new one against dev! That was just to get rid of the random processes. I'm also 100% sure it won't run, so we can delete all the sections of the workflow and start fresh. With DSL2 we may not have to do so many workarounds with the sample sheets.
I'd go for the "move fast and break things" approach for the dev branch
Agreed! Luckily, it doesn't seem like anyone is depending on it right now, or the beauty of nextflow is they can easily use v1.
Is there somewhere we can host the test data other than github? is gh LFS an option here?
https://github.com/nf-core/test-datasets is usually the preferred place. We can just toss out some data and cut it down to a few tiles. So we shouldn't need LFS, since we don't really care about the speed of git on that repo.
I'd like to suggest a new subworkflow which demultiplexes a sequencer output directory to fastq's and performs basic QC.
nf-core modules list
commandAssignees
to facilitate tracking who is working on the moduleRelated to #1484 and #1485
EDIT: Updated flowchart