Enable multi-lane support

nf-core / exoseq

Please consider using/contributing to https://github.com/nf-core/sarek

http://nf-co.re

MIT License

16 stars 28 forks source link

Enable multi-lane support #20

Open marchoeppner opened 6 years ago

marchoeppner commented 6 years ago

Some sequencing setups will split libraries across lanes. This is currently not modeled in the pipeline.

Using a CSV to keep track of IndividualID and sampleID, we could do something along these lines:

runBWAOutput_grouped_by_sample = runBWAOutput.groupTuple(by: [0,1])

process mergeBamFiles_bySample {

        tag "${indivID}|${sampleID}"

    input:
        set indivID, sampleID, file(aligned_bam_list) from runBWAOutput_grouped_by_sample

    output:
    set indivID,sampleID,file(merged_bam) into mergedBamFile_by_Sample

    script:
    merged_bam = sampleID + "merged.bam"

    """
        java -jar -Djava.io.tmpdir=tmp/ -jar ${PICARD} MergeSamFiles \
            INPUT=${aligned_bam_list.join(' INPUT=')} \
            OUTPUT=${merged_bam} \
            CREATE_INDEX=false \
            CREATE_MD5_FILE=false \
            SORT_ORDER=coordinate
    """
}

andreas-wilm commented 6 years ago

Hi Marco,

I think this is important (more so for WGS samples). We use an approach where an input yaml file is created, that consists of a samples dictionary, with so called readunits as its members. The only mandatory readunit key is fq1 (fastq R1). The others are fq2, run_id, lane_id, library_id, flowcell_id. This allows to also construct a read group based on these values.

The input channel is then set up as follows:

def GetReadPair = { sk, rk ->
    tuple(file(params.samples[sk].readunits[rk]['fq1']),
          file(params.samples[sk].readunits[rk]['fq2']))
}

def GetReadUnitKeys = { sk ->
    params.samples[sk].readunits.keySet()
}

Channel
    .from(sample_keys)
    .map { sk -> tuple(sk, GetReadUnitKeys(sk).collect{GetReadPair(sk, it)}.flatten()) }
    .set { fastq_ch }

There might be more elegant ways. Probably makes sense to discuss this on Gitter...

maxulysse commented 6 years ago

In Sarek, we have one BWA-mem | samtools sort process cf: https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L166-L187 Then we're grouping samples by the read groups: https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L199-L205 And then merging the BAMs: https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L207-L222

vsmalladi commented 6 years ago

For read pairs we do this:

// Define channel for raw reads
if (pairedEnd) {
  rawReads = designFilePaths
    .splitCsv(sep: '\t', header: true)
    .map { row -> [ row.sample_id, [row.fastq_read1, row.fastq_read2], row.experiment_id, row.biosample, row.factor, row.treatment, row.replicate, row.control_id ] }
} else {
rawReads = designFilePaths
  .splitCsv(sep: '\t', header: true)
  .map { row -> [ row.sample_id, [row.fastq_read1], row.experiment_id, row.biosample, row.factor, row.treatment, row.replicate, row.control_id ] }
}

andreas-wilm commented 6 years ago

Thanks @vsmalladi and @MaxUlysse for the references!