Suggested replacement of inefficient part of pipeline (subset_aligned_reads)

medmaca commented 1 month ago

Hi, firstly thank you for creating the nabseq_nf pipeline.

I've been running the nabseq pipeline on Seqera/Tower and noticed that the biggest bottleneck is the subset_aligned_reads step, which for larger libraries was taking many days to complete. I have a potential work around which I'll attach here for your consideration which I think significantly speeds up the pipeline and I believe generates the same output.

It involves 3 main changes.

Modifying the reference files to remove spaces in headers (replace with @ symbols):

modified_references.zip

This is required so that the SAM files produced in the my minimap2 don't have duplicates sequence names

Modifying minimap2.nf

minimap2.zip

Changing the output from minimap2 from PAF to SAM, only retaining the best matches. This relies on using the modified reference files that have no spaces in the headers to generate SAM files with correct format for use by samtools in the next step.

// adapted from the nf-core module: https://github.com/nf-core/modules/tree/master/modules/nf-core/minimap2/align
// (replaced meta$id with just meta) because sample name is our only metadata
// also set --secondary=no since we don't want secondary alignments and removed the bam output option (& samtools) since it was not needed 
// finally, changed to also emit the reads it aligned (because we need them in the next step of subsetting)
// idk why 

process minimap2_alignment {
    tag "$prefix"
    label 'process_high'

    conda (params.enable_conda ? 'bioconda::minimap2=2.24' : null)
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/minimap2%3A2.24--h7132678_1' :
        'quay.io/biocontainers/minimap2:2.24--h7132678_1' }"

    input:
    tuple val(meta), path(reads)
    path reference

    output:
    tuple val(meta), path("*.sam"), path(reads), optional: true, emit: paf_reads
    tuple val(meta), path(reads), path(reference), path("*.sam"), optional: true, emit: for_racon

    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''

    // allow for a bunch of metadata (although the first element should be sample name)
    if(meta instanceof Collection) {
        prefix = meta[0]
    } else {
        prefix = meta
    }

    """
    minimap2 \\
        $args \\
        -t $task.cpus \\
        --secondary=no \\
        -x map-ont \\
        "${reference ?: reads}" \\
        "$reads" \\
        -o ${prefix}.sam \\
        --sam-hit-only \\
        -a
    """
}

Modifying select_ab_reads.nf

select_ab_reads.zip

Modified to use samtools to convert from SAM to fastq format.

// do minimap2 alignment to the specified reference
// then, use seqkit to select the reads that align

include { minimap2_alignment } from '../modules/local/minimap2' 

process subset_aligned_reads {
    tag {prefix}
    label 'process_high'

    conda (params.enable_conda ? 'bioconda::seqkit=2.3.1' : null)
    // container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
    //     'https://depot.galaxyproject.org/singularity/seqkit%3A2.3.1--h9ee0642_0' :
    //     'quay.io/biocontainers/seqkit:2.3.1--h9ee0642_0' }"

    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
    'https://depot.galaxyproject.org/singularity/samtools:1.20--h50ea8bc_1' :
    'quay.io/biocontainers/samtools:1.20--h50ea8bc_1' }"

    input:
    tuple val(meta), path(sam_file), path(reads)

    output:
    tuple val(meta), path("*_ab_reads.fastq")

    script:

    // allow for a bunch of metadata (although the first element should be sample name)
    if(meta instanceof Collection) {
        prefix = meta[0]
    } else {
        prefix = meta
    }

    """
    # aligned read names are the first column of the PAF file
    # cut -f1 $sam_file > "${prefix}_ab_read_names.txt"

    # use these names as a pattern for seqkit grep to find
    # seqkit grep --by-name --use-regexp --threads ${task.cpus} \
    #-f "${prefix}_ab_read_names.txt" \
    #$reads \
    # -o "${prefix}_ab_reads.fastq"

    samtools fastq -@ ${task.cpus} $sam_file > ${prefix}_ab_reads.fastq

    """

}

workflow select_ab_reads {
    take:
        ab_selection_input

    main:
        // split input into reads + meta data and then reference file
        ab_selection_input
            .map{it[-1]}
            .set{reference}

        ab_selection_input
            .map{it[0..-2]}
            .map{tuple([it[0], it[2], it[3]], it[1])}
            .set{minimap2_input}

        // align reads using minimap2
        minimap2_alignment(minimap2_input, reference)

        // subset the aligned reads using seqkit
        ab_reads = subset_aligned_reads(minimap2_alignment.out.paf_reads)

    emit:
        ab_reads

}

I hope this is helpful.

kzeglinski commented 1 month ago

Hey thank you so much! We usually only run the pipeline with 20-40 hybridomas, each of which have ~100k reads, so I never noticed how horribly this step scales as the number of reads increases. I really appreciate you taking the time to share your improved code and have now incorporated it into the pipeline.

Also, just on the topic of performance, I think one other place where things can be optimised is the consensus calling step. Back in 2021 when we developed this pipeline, the combination of racon + medaka was required to generate good consensus sequences, especially when there were only like 5-20 reads per transcript. But now that we have kit 14, dorado v5 basecalling models etc I think it's a bit of overkill. I've had good results in some preliminary tests (100% accuracy achieved with 3-5 reads in most cases) just using abPOA which is super fast. When I have some time in the next few months I'm planning on doing some formal testing of this and will probably ultimately replace the racon + medaka consensus with abPOA. Just wanted to let you know in case that might be helpful for you!

gitkemp commented 1 month ago

@medmaca @kzeglinski Hi Guys, I am trying out the nabseq workflow and implementing it my lab, I am not a programmer or bio-informatician, I have Dell precession tower work station with RTX4060. What changes should I need to make it work on my computer and any suggestions regarding the cDNA sequencing kit that comes with the minion starter pack. I will use this kit for the first sequencing experiment.

medmaca commented 1 month ago

Hey thank you so much! We usually only run the pipeline with 20-40 hybridomas, each of which have ~100k reads, so I never noticed how horribly this step scales as the number of reads increases. I really appreciate you taking the time to share your improved code and have now incorporated it into the pipeline.

Also, just on the topic of performance, I think one other place where things can be optimised is the consensus calling step. Back in 2021 when we developed this pipeline, the combination of racon + medaka was required to generate good consensus sequences, especially when there were only like 5-20 reads per transcript. But now that we have kit 14, dorado v5 basecalling models etc I think it's a bit of overkill. I've had good results in some preliminary tests (100% accuracy achieved with 3-5 reads in most cases) just using abPOA which is super fast. When I have some time in the next few months I'm planning on doing some formal testing of this and will probably ultimately replace the racon + medaka consensus with abPOA. Just wanted to let you know in case that might be helpful for you!

Glad it was useful, and thank you for the heads up regarding abPOA.

kzeglinski / nabseq_nf