Open adamrtalbot opened 2 months ago
I think this is a great training module plan. hello-gatk
has a lot of content and this feels like a logical grouping to split out.
Objective: Know how to view the contents of a channel
.view()
to debug a channel// Create input channel from list of input files in plain text
reads_ch = Channel.fromPath(params.reads_bam)
.splitText()
.view()
Objective: Understand how sample information can be associated with a sample
Would break down into multiple steps with use of .view()
to inspect channel contents.
// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
.splitCsv(header: true)
.map()
to modify items in a channel// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
.splitCsv(header: true)
.map{ row -> [row.id, file(row.bam)] }
input:
tuple val(id), path(bam){, path(bai)}
etc.
Objective: Understand how sample information can be used to make Nextflow extremely scalable
Support > 1 family per run by adding a family (cohort) ID to the sample sheet
// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
.splitCsv(header: true)
.map{ row ->
[
[
id: row.id,
family: row.family,
],
file(row.bam)
]
}
Note: nf-schema can do this for you.
input:
tuple val(meta), path(bam){, path(bai)}
Output of haplotyper:
output:
tuple val(meta), path("${input_bam}.g.vcf"), path("${input_bam}.g.vcf.idx")
Collect families using groupTuple:
GATK_HAPLOTYPECALLER.out
.map { meta, bam, bai ->
meta.family, meta, bam, bai
}
.groupTuple()
Add a fake family 2 to the input CSV:
family1,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family1,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family1,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
family2,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family2,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family2,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
Thoughts? Does this cover sufficient objectives? Should we be extending it further and including more operators? If so, which? Is it too much and requires re-wiring of the whole pipeline too much?
This is perfect. I really like the progression that you've designed here. I was trying to explain this concept to someone fresh out of hello-gatk using our existing training materials (in Advanced) and I quickly ran into the issue of needing to explain things that hadn't been covered yet.
After discussion with @maxulysse, we think we can make it better.
collect()
Then we could have another module afterwards which includes more advanced concepts like map
and groupTuple
.
Objective: Understand how to collect a channel into 1 item.
As hello-gatk. Run and see every VCF is being ran separately.
.view()
to inspect the contents of a channel// Create input channel from list of input files in plain text
reads_ch = Channel.fromPath(params.reads_bam)
.splitText()
.view()
all_vcfs = GATK_HAPLOTYPECALLER.out[0].collect()
all_tbis = GATK_HAPLOTYPECALLER.out[1].collect()
all_vcfs.view()
all_tbis.view()
See only 1 process ran.
Objective: Understand how sample information can be associated with a sample
Would break down into multiple steps with use of .view()
to inspect channel contents.
// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
.splitCsv(header: true)
.map()
to modify items in a channel// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
.splitCsv(header: true)
.map{ row -> [row.id, file(row.bam)] }
input:
tuple val(id), path(bam){, path(bai)}
etc.
or hello-meta?
Objective: Understand how sample information can be used to make Nextflow extremely scalable
Support > 1 family per run by adding a family (cohort) ID to the sample sheet
// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
.splitCsv(header: true)
.map{ row ->
[
[
id: row.id,
family: row.family,
],
file(row.bam)
]
}
Note: nf-schema can do this for you.
input:
tuple val(meta), path(bam){, path(bai)}
Output of haplotyper:
output:
tuple val(meta), path("${input_bam}.g.vcf"), path("${input_bam}.g.vcf.idx")
Collect families using groupTuple:
GATK_HAPLOTYPECALLER.out
.map { meta, bam, bai ->
meta.family, meta, bam, bai
}
.groupTuple()
Add a fake family 2 to the input CSV:
family1,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family1,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family1,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
family2,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family2,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family2,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
Objective: Understand how to join channel contents together by common element.
We could join just prior to groupTuple above? Unclear where it would fit in best here, but that's outside of the scope of this issue.
- Move the jointgenotyping part out of hello-gatk and make it the first introduction to hello-channels
I love the overall plan for Hello-Channels but I think I'd like to keep the joint-genotyping as part of the Hello-GATK module, because it makes for a very satisfying example as it stands now.
However I could be convinced to change my mind because as I type I realize this could be an opportunity to simplify GATK further (the GVCF stuff is a bit of a curve ball). We could change Hello-GATK to emit regular VCFs and have that module show a purely linear example (and also keep the groovy magic mostly out of the 'first bioinfx example' for simplicity). And that way people are already a bit further down their Nextflow journey when they hit the more interesting plumbing options.
Ok I've gone and convinced myself this is the way to go.
Question: should this new Hello-Channels module come before or after the Config/Modules/nf-test ones? (note that I want to move hello-config to before hello-modules)
I implemented part of this in https://github.com/nextflow-io/training/pull/408 with the following caveats:
Added use of .view()
to inspect the contents of a channel earlier, in hello-genomics (formerly hello-gatk)
The use of collect() for bringing GVCFs together + the closure and join() to generate the concatenated string for the GenomicsDBImport command end up taking a lot of explaining (could probably use more explicit/ worked out .view()ing but that will have to be for later)
The introduction of samplesheet and metamap is very compelling but I think it needs to be its own "Hello Meta" module, which I don't have the time to put together now. But it's the logical next expansion todo.
hello-channels
An additional module that would fit between hello-gatk and hello-modules.
Aims:
Proposal:
Subject to change, this part might need further discussion.
From the hello-gatk pipeline, add the following features stepwise
splitCsv
)map
)groupTuple
)Key targets:
view
for debuggingmap
for manipulating channel contentscollectFile
,groupTuple
,join
for demonstrating how channels can be manipulated with built in methods.To do:
Related issues
https://github.com/nextflow-io/training/issues/361 https://github.com/nextflow-io/training/issues/359