nextflow-io / training

Nextflow training material
https://training.nextflow.io/
Other
128 stars 120 forks source link

Module proposal: hello-channels #367

Open adamrtalbot opened 2 months ago

adamrtalbot commented 2 months ago

hello-channels

An additional module that would fit between hello-gatk and hello-modules.

Aims:

Proposal:

Subject to change, this part might need further discussion.

From the hello-gatk pipeline, add the following features stepwise

  1. Use a samplesheet to read in the BAM files (splitCsv)
  2. Add a sample ID to each BAM file (tuples)
  3. Pass the tuple between all processes with a manipulation (map)
  4. Group per family ID (groupTuple)
  5. Create samplesheet output

Key targets:

To do:

Related issues

https://github.com/nextflow-io/training/issues/361 https://github.com/nextflow-io/training/issues/359

kenibrewer commented 2 months ago

I think this is a great training module plan. hello-gatk has a lot of content and this feels like a logical grouping to split out.

adamrtalbot commented 1 month ago

hello-channels

1 Debugging

Objective: Know how to view the contents of a channel

1.1. Use .view() to debug a channel

// Create input channel from list of input files in plain text
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitText()
                    .view()

2 Add sample ID to samples

Objective: Understand how sample information can be associated with a sample

2.1. Read sample ID from CSV file

Would break down into multiple steps with use of .view() to inspect channel contents.

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)

2.2. Use .map() to modify items in a channel

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> [row.id, file(row.bam)] }

2.3. Carry sample ID through the pipeline

input:
    tuple val(id), path(bam){, path(bai)}

etc.

3 Maps (key-val pairs) and family ID

Objective: Understand how sample information can be used to make Nextflow extremely scalable

Support > 1 family per run by adding a family (cohort) ID to the sample sheet

3.1 Use a meta map as the first value

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> 
                        [
                            [
                                id: row.id,
                                family: row.family,
                            ],
                            file(row.bam)
                        ] 
                    }

Note: nf-schema can do this for you.

3.2 Replace sample ID with meta map:

input:
    tuple val(meta), path(bam){, path(bai)}

3.3. Aggregate per-family prior to performing jointgenotyping

Output of haplotyper:

output:
    tuple val(meta), path("${input_bam}.g.vcf"), path("${input_bam}.g.vcf.idx")

Collect families using groupTuple:

GATK_HAPLOTYPECALLER.out
    .map { meta, bam, bai ->
        meta.family, meta, bam, bai
    }
    .groupTuple()

Add a fake family 2 to the input CSV:

family1,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family1,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family1,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
family2,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family2,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family2,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam

Thoughts? Does this cover sufficient objectives? Should we be extending it further and including more operators? If so, which? Is it too much and requires re-wiring of the whole pipeline too much?

kenibrewer commented 1 month ago

This is perfect. I really like the progression that you've designed here. I was trying to explain this concept to someone fresh out of hello-gatk using our existing training materials (in Advanced) and I quickly ran into the issue of needing to explain things that hadn't been covered yet.

adamrtalbot commented 1 month ago

After discussion with @maxulysse, we think we can make it better.

Then we could have another module afterwards which includes more advanced concepts like map and groupTuple.

adamrtalbot commented 1 month ago

hello-channels

1 Collect

Objective: Understand how to collect a channel into 1 item.

1.1. Add jointgenotyping process

As hello-gatk. Run and see every VCF is being ran separately.

1.1. Use .view() to inspect the contents of a channel

// Create input channel from list of input files in plain text
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitText()
                    .view()

1.2. Collect results of haplotyper process

all_vcfs = GATK_HAPLOTYPECALLER.out[0].collect()
all_tbis = GATK_HAPLOTYPECALLER.out[1].collect()

1.3. View to see the contents of the collection

all_vcfs.view()
all_tbis.view()

1.4. Run with jointgenotyping again

See only 1 process ran.

2 Add sample ID to samples

Objective: Understand how sample information can be associated with a sample

2.1. Read sample ID from CSV file

Would break down into multiple steps with use of .view() to inspect channel contents.

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)

2.2. Use .map() to modify items in a channel

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> [row.id, file(row.bam)] }

2.3. Carry sample ID through the pipeline

input:
    tuple val(id), path(bam){, path(bai)}

etc.

hello-operator

or hello-meta?

1 maps (key-val pairs) and family ID

Objective: Understand how sample information can be used to make Nextflow extremely scalable

Support > 1 family per run by adding a family (cohort) ID to the sample sheet

1.1 Use a meta map as the first value

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> 
                        [
                            [
                                id: row.id,
                                family: row.family,
                            ],
                            file(row.bam)
                        ] 
                    }

Note: nf-schema can do this for you.

1.2 Replace sample ID with meta map:

input:
    tuple val(meta), path(bam){, path(bai)}

1.3. Aggregate per-family prior to performing jointgenotyping

Output of haplotyper:

output:
    tuple val(meta), path("${input_bam}.g.vcf"), path("${input_bam}.g.vcf.idx")

Collect families using groupTuple:

GATK_HAPLOTYPECALLER.out
    .map { meta, bam, bai ->
        meta.family, meta, bam, bai
    }
    .groupTuple()

Add a fake family 2 to the input CSV:

family1,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family1,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family1,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
family2,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family2,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family2,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam

2 join?

Objective: Understand how to join channel contents together by common element.

We could join just prior to groupTuple above? Unclear where it would fit in best here, but that's outside of the scope of this issue.

vdauwera commented 1 month ago
  • Move the jointgenotyping part out of hello-gatk and make it the first introduction to hello-channels

I love the overall plan for Hello-Channels but I think I'd like to keep the joint-genotyping as part of the Hello-GATK module, because it makes for a very satisfying example as it stands now.

However I could be convinced to change my mind because as I type I realize this could be an opportunity to simplify GATK further (the GVCF stuff is a bit of a curve ball). We could change Hello-GATK to emit regular VCFs and have that module show a purely linear example (and also keep the groovy magic mostly out of the 'first bioinfx example' for simplicity). And that way people are already a bit further down their Nextflow journey when they hit the more interesting plumbing options.

Ok I've gone and convinced myself this is the way to go.

Question: should this new Hello-Channels module come before or after the Config/Modules/nf-test ones? (note that I want to move hello-config to before hello-modules)

vdauwera commented 3 weeks ago

I implemented part of this in https://github.com/nextflow-io/training/pull/408 with the following caveats: