adamrtalbot commented 2 months ago

hello-channels

An additional module that would fit between hello-gatk and hello-modules.

Aims:

Teach users about the concepts of channels and functional programming with Nextflow
Teach users about data structure within channels
Teach users practical examples of operators to manipulate channels

Proposal:

Subject to change, this part might need further discussion.

From the hello-gatk pipeline, add the following features stepwise

Use a samplesheet to read in the BAM files (splitCsv)
Add a sample ID to each BAM file (tuples)
Pass the tuple between all processes with a manipulation (map)
Group per family ID (groupTuple)
Create samplesheet output

Key targets:

view for debugging
map for manipulating channel contents
1 to 3 more advanced operators such as collectFile, groupTuple, join for demonstrating how channels can be manipulated with built in methods.

To do:

Write final endpoint pipeline to be aiming for
Write intermediate steps as tutorial
Add any changes to hello-modules and hello-nf-test that need to be included

kenibrewer commented 2 months ago

I think this is a great training module plan. hello-gatk has a lot of content and this feels like a logical grouping to split out.

adamrtalbot commented 1 month ago

hello-channels

1 Debugging

Objective: Know how to view the contents of a channel

1.1. Use `.view()` to debug a channel

// Create input channel from list of input files in plain text
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitText()
                    .view()

2 Add sample ID to samples

Objective: Understand how sample information can be associated with a sample

2.1. Read sample ID from CSV file

Would break down into multiple steps with use of .view() to inspect channel contents.

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)

2.2. Use `.map()` to modify items in a channel

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> [row.id, file(row.bam)] }

2.3. Carry sample ID through the pipeline

input:
    tuple val(id), path(bam){, path(bai)}

etc.

3 Maps (key-val pairs) and family ID

Objective: Understand how sample information can be used to make Nextflow extremely scalable

Support > 1 family per run by adding a family (cohort) ID to the sample sheet

3.1 Use a meta map as the first value

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> 
                        [
                            [
                                id: row.id,
                                family: row.family,
                            ],
                            file(row.bam)
                        ] 
                    }

Note: nf-schema can do this for you.

3.2 Replace sample ID with meta map:

input:
    tuple val(meta), path(bam){, path(bai)}

3.3. Aggregate per-family prior to performing jointgenotyping

Output of haplotyper:

output:
    tuple val(meta), path("${input_bam}.g.vcf"), path("${input_bam}.g.vcf.idx")

Collect families using groupTuple:

GATK_HAPLOTYPECALLER.out
    .map { meta, bam, bai ->
        meta.family, meta, bam, bai
    }
    .groupTuple()

Add a fake family 2 to the input CSV:

family1,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family1,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family1,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
family2,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family2,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family2,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam

Thoughts? Does this cover sufficient objectives? Should we be extending it further and including more operators? If so, which? Is it too much and requires re-wiring of the whole pipeline too much?

kenibrewer commented 1 month ago

This is perfect. I really like the progression that you've designed here. I was trying to explain this concept to someone fresh out of hello-gatk using our existing training materials (in Advanced) and I quickly ran into the issue of needing to explain things that hadn't been covered yet.

adamrtalbot commented 1 month ago

After discussion with @maxulysse, we think we can make it better.

Move the jointgenotyping part out of hello-gatk and make it the first introduction to hello-channels
Use view to inspect the contents of the channel before and after collect()
add sample ID to samplesheet as above

Then we could have another module afterwards which includes more advanced concepts like map and groupTuple.

adamrtalbot commented 1 month ago

hello-channels

1 Collect

Objective: Understand how to collect a channel into 1 item.

1.1. Add jointgenotyping process

As hello-gatk. Run and see every VCF is being ran separately.

1.1. Use `.view()` to inspect the contents of a channel

// Create input channel from list of input files in plain text
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitText()
                    .view()

1.2. Collect results of haplotyper process

all_vcfs = GATK_HAPLOTYPECALLER.out[0].collect()
all_tbis = GATK_HAPLOTYPECALLER.out[1].collect()

1.3. View to see the contents of the collection

all_vcfs.view()
all_tbis.view()

1.4. Run with jointgenotyping again

See only 1 process ran.

2 Add sample ID to samples

Objective: Understand how sample information can be associated with a sample

2.1. Read sample ID from CSV file

Would break down into multiple steps with use of .view() to inspect channel contents.

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)

2.2. Use `.map()` to modify items in a channel

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> [row.id, file(row.bam)] }

2.3. Carry sample ID through the pipeline

input:
    tuple val(id), path(bam){, path(bai)}

etc.

hello-operator

or hello-meta?

1 maps (key-val pairs) and family ID

Objective: Understand how sample information can be used to make Nextflow extremely scalable

Support > 1 family per run by adding a family (cohort) ID to the sample sheet

1.1 Use a meta map as the first value

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> 
                        [
                            [
                                id: row.id,
                                family: row.family,
                            ],
                            file(row.bam)
                        ] 
                    }

Note: nf-schema can do this for you.

1.2 Replace sample ID with meta map:

input:
    tuple val(meta), path(bam){, path(bai)}

1.3. Aggregate per-family prior to performing jointgenotyping

Output of haplotyper:

output:
    tuple val(meta), path("${input_bam}.g.vcf"), path("${input_bam}.g.vcf.idx")

Collect families using groupTuple:

GATK_HAPLOTYPECALLER.out
    .map { meta, bam, bai ->
        meta.family, meta, bam, bai
    }
    .groupTuple()

Add a fake family 2 to the input CSV:

family1,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family1,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family1,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
family2,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family2,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family2,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam

2 join?

Objective: Understand how to join channel contents together by common element.

We could join just prior to groupTuple above? Unclear where it would fit in best here, but that's outside of the scope of this issue.

vdauwera commented 1 month ago

Move the jointgenotyping part out of hello-gatk and make it the first introduction to hello-channels

I love the overall plan for Hello-Channels but I think I'd like to keep the joint-genotyping as part of the Hello-GATK module, because it makes for a very satisfying example as it stands now.

However I could be convinced to change my mind because as I type I realize this could be an opportunity to simplify GATK further (the GVCF stuff is a bit of a curve ball). We could change Hello-GATK to emit regular VCFs and have that module show a purely linear example (and also keep the groovy magic mostly out of the 'first bioinfx example' for simplicity). And that way people are already a bit further down their Nextflow journey when they hit the more interesting plumbing options.

Ok I've gone and convinced myself this is the way to go.

Question: should this new Hello-Channels module come before or after the Config/Modules/nf-test ones? (note that I want to move hello-config to before hello-modules)

vdauwera commented 3 weeks ago

I implemented part of this in https://github.com/nextflow-io/training/pull/408 with the following caveats:

Added use of .view() to inspect the contents of a channel earlier, in hello-genomics (formerly hello-gatk)
The use of collect() for bringing GVCFs together + the closure and join() to generate the concatenated string for the GenomicsDBImport command end up taking a lot of explaining (could probably use more explicit/ worked out .view()ing but that will have to be for later)
The introduction of samplesheet and metamap is very compelling but I think it needs to be its own "Hello Meta" module, which I don't have the time to put together now. But it's the logical next expansion todo.

nextflow-io / training

Module proposal: hello-channels #367

hello-channels

Aims:

Proposal:

Key targets:

To do:

Related issues

hello-channels

1 Debugging

1.1. Use `.view()` to debug a channel

2 Add sample ID to samples

2.1. Read sample ID from CSV file

2.2. Use `.map()` to modify items in a channel

2.3. Carry sample ID through the pipeline

3 Maps (key-val pairs) and family ID

3.1 Use a meta map as the first value

3.2 Replace sample ID with meta map:

3.3. Aggregate per-family prior to performing jointgenotyping

hello-channels

1 Collect

1.1. Add jointgenotyping process

1.1. Use `.view()` to inspect the contents of a channel

1.2. Collect results of haplotyper process

1.3. View to see the contents of the collection

1.4. Run with jointgenotyping again

2 Add sample ID to samples

2.1. Read sample ID from CSV file

2.2. Use `.map()` to modify items in a channel

2.3. Carry sample ID through the pipeline

hello-operator

1 maps (key-val pairs) and family ID

1.1 Use a meta map as the first value

1.2 Replace sample ID with meta map:

1.3. Aggregate per-family prior to performing jointgenotyping

2 join?

nextflow-io / training

Module proposal: hello-channels #367

hello-channels

Aims:

Proposal:

Key targets:

To do:

Related issues

hello-channels

1 Debugging

1.1. Use .view() to debug a channel

2 Add sample ID to samples

2.1. Read sample ID from CSV file

2.2. Use .map() to modify items in a channel

2.3. Carry sample ID through the pipeline

3 Maps (key-val pairs) and family ID

3.1 Use a meta map as the first value

3.2 Replace sample ID with meta map:

3.3. Aggregate per-family prior to performing jointgenotyping

hello-channels

1 Collect

1.1. Add jointgenotyping process

1.1. Use .view() to inspect the contents of a channel

1.2. Collect results of haplotyper process

1.3. View to see the contents of the collection

1.4. Run with jointgenotyping again

2 Add sample ID to samples

2.1. Read sample ID from CSV file

2.2. Use .map() to modify items in a channel

2.3. Carry sample ID through the pipeline

hello-operator

1 maps (key-val pairs) and family ID

1.1 Use a meta map as the first value

1.2 Replace sample ID with meta map:

1.3. Aggregate per-family prior to performing jointgenotyping

2 join?

1.1. Use `.view()` to debug a channel

2.2. Use `.map()` to modify items in a channel

1.1. Use `.view()` to inspect the contents of a channel

2.2. Use `.map()` to modify items in a channel