Open marchoeppner opened 6 years ago
Hi Marco,
I think this is important (more so for WGS samples). We use an approach where an input yaml file is created, that consists of a samples
dictionary, with so called readunits
as its members. The only mandatory readunit key is fq1
(fastq R1). The others are fq2
, run_id
, lane_id
, library_id
, flowcell_id
. This allows to also construct a read group based on these values.
The input channel is then set up as follows:
def GetReadPair = { sk, rk ->
tuple(file(params.samples[sk].readunits[rk]['fq1']),
file(params.samples[sk].readunits[rk]['fq2']))
}
def GetReadUnitKeys = { sk ->
params.samples[sk].readunits.keySet()
}
Channel
.from(sample_keys)
.map { sk -> tuple(sk, GetReadUnitKeys(sk).collect{GetReadPair(sk, it)}.flatten()) }
.set { fastq_ch }
There might be more elegant ways. Probably makes sense to discuss this on Gitter...
In Sarek, we have one BWA-mem | samtools sort process cf: https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L166-L187 Then we're grouping samples by the read groups: https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L199-L205 And then merging the BAMs: https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L207-L222
For read pairs we do this:
// Define channel for raw reads
if (pairedEnd) {
rawReads = designFilePaths
.splitCsv(sep: '\t', header: true)
.map { row -> [ row.sample_id, [row.fastq_read1, row.fastq_read2], row.experiment_id, row.biosample, row.factor, row.treatment, row.replicate, row.control_id ] }
} else {
rawReads = designFilePaths
.splitCsv(sep: '\t', header: true)
.map { row -> [ row.sample_id, [row.fastq_read1], row.experiment_id, row.biosample, row.factor, row.treatment, row.replicate, row.control_id ] }
}
Thanks @vsmalladi and @MaxUlysse for the references!
Some sequencing setups will split libraries across lanes. This is currently not modeled in the pipeline.
Using a CSV to keep track of IndividualID and sampleID, we could do something along these lines: