nhoffman / dada2-nf

A Nextflow pipeline for processing 16S rRNA sequences using dada2
0 stars 2 forks source link

Incorporate logic from chim_ints.R script into dada2-nf pipeline #17

Open dhoogest opened 3 years ago

dhoogest commented 3 years ago

There is a small R script in the current NGS16S pipeline which extracts from the dada2.rda file the sequences which were dropped as chimeras and outputs a csv with columns weight and sequence. This file is not used for any downstream classification pipeline dependencies, however is useful to have for troubleshooting when a larger fraction of reads is lost during the chimera removal phase. The script currently outputs a file for each sample in the run, but it is not necessary to separate per sample if that is less convenient for the nextflow pipeline

#!/usr/bin/env Rscript

suppressPackageStartupMessages(library(argparse, quietly = TRUE))
suppressPackageStartupMessages(library(tidyverse, quietly = TRUE))

do_intermediates <- function(dada.path, id, out.path) {
    load(dada.path)
    pre <- as.data.frame(as.table(seqtab))
    post <- as.data.frame(as.table(seqtab.nochim))
    pre %>% anti_join(post, by=c('Var2')) %>%
        filter(Var1==id) %>% # ex: '624-27'
        arrange(-Freq) %>% rename(sequence=Var2) %>%
        rename(weight=Freq) %>% select(weight, sequence) %>%
        write_csv(out.path)
}

main <- function(arguments) {
    parser <- ArgumentParser()
    parser$add_argument('--rdata', default='dada2.rda')
    parser$add_argument('--id')
    parser$add_argument('--out')

    args <- parser$parse_args(arguments)

    do_intermediates(args$rdata, args$id, args$out)
}

main(commandArgs(trailingOnly=TRUE))
dhoogest commented 2 years ago

We'll need to bump this in response to https://gitlab.labmed.uw.edu/molmicro/NGS16S/-/issues/285

dhoogest commented 2 years ago

I've got changes in #49 to complete this issue. Need to find a sample which demonstrates significant abundance of chimera and add that to the ngs16s test data set. /cc @nhoffman

marykstewart commented 2 years ago

@dhoogest You asked for test specimens, here's one from today where 10% reads were dropped at the chimeras removed filtering step https://share.labmed.uw.edu/molmicro/markergene/22N0238_NGS16S/report/22R192-11/ Seems as good as any for a validation run, we know our current version of dada2 is flagging svs as chimeric here.