sanger-tol / readmapping

Nextflow DSL2 pipeline to align short and long reads to genome assembly. This workflow is part of the Tree of Life production suite.
https://pipelines.tol.sanger.ac.uk/readmapping
MIT License
11 stars 6 forks source link

add input type `fastq.gz` and `fq.gz` for Illumina and HiC reads #96

Closed reichan1998 closed 4 months ago

reichan1998 commented 4 months ago

To solve issue #88

PR checklist

tkchafin commented 4 months ago

For this one since align_illumina_fastq is just copying all of the steps (except for CRAM -> fastq conversion), it would be simpler to using a branch operation to determine if the SAMTOOLS_FASTQ step is needed (true for CRAM but false if input type is FASTQ).

For example you can see further on in align_short.nf where there is a branch operation to determine whether SAMTOOLS_MERGE will be run:

    // Collect all BWAMEM2 output by sample name
    BWAMEM2_MEM.out.bam
    | map { meta, bam -> [['id': meta.id.split('_')[0..-2].join('_'), 'datatype': meta.datatype], meta.read_count, bam] }
    | groupTuple( by: [0] )
    | map { meta, read_counts, bams -> [meta + [read_count: read_counts.sum()], bams] }
    | branch {
        meta, bams ->
            single_bam: bams.size() == 1
            multi_bams: true
    }
    | set { ch_bams }

    // Merge, but only if there is more than 1 file
    SAMTOOLS_MERGE ( ch_bams.multi_bams, [ [], [] ], [ [], [] ] )
    ch_versions = ch_versions.mix ( SAMTOOLS_MERGE.out.versions.first() )

The branch section of this is creating two branches: single_bam if the sample has a single BAM file, or multi_bams, which catches all cases where there is not a single BAM file:

    | branch {
        meta, bams ->
            single_bam: bams.size() == 1
            multi_bams: true
    }

The next step is we run SAMTOOLS_MERGE only on the multi_bams branch, and then combine the outputs with the single_bam branch. The process that you will need to implement for FASTQ vs CRAM inputs will be similar

tkchafin commented 4 months ago

Also note that ALIGN_SHORT is actually used for both Illumina and HiC reads, so any changes in documentation for Illumina also applies to HiC:

include { ALIGN_SHORT as ALIGN_HIC      } from '../subworkflows/local/align_short'
include { ALIGN_SHORT as ALIGN_ILLUMINA } from '../subworkflows/local/align_short'
reichan1998 commented 4 months ago

Thank you very much for your instructions, @tkchafin. I noticed that ALIGN_SHORT is used for both Illumina and HiC reads. Therefore, I decided to add a condition to branch reads with the fastq.gz format in readmapping.nf, so it will leave the ALIGN_HIC unchanged. Your approach is certainly more efficient and structured, and I will adjust as you described.

tkchafin commented 4 months ago

I think we can allow the FASTQ option for both HiC and Illumina, so it might be best to keep the branching inside of align_short (the feature request actually was about HiC https://github.com/sanger-tol/readmapping/issues/88 )

reichan1998 commented 4 months ago

I'm so sorry for the oversight and misunderstanding regarding the FASTQ option for HiC. I've made the changes as per your instructions . Thank you for your review and guidance.

reichan1998 commented 4 months ago

Hi @tkchafin, I've made the changes following your instructions. Could you please help me review them? I really appreciate your guidance. Thank you!

tkchafin commented 4 months ago

Hi @reichan1998 thanks for these changes. I've been off sick the past few days, but will have a look soon.

tkchafin commented 4 months ago

I have added the file to the hosted test_data, the path is https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4%231.subset.fastq so you can put this into the assets/samplesheet_s3.csv for mMelMel2,illumina to test the pipeline with the fastq input option for that sample