Allow to analyse 454 sequencing data

d4straub commented 8 months ago

Description of feature

Idea

454 sequencing is probably not as much used for amplicon sequencing as Illumina MiSeq nowadays, but it was popular some time ago and is still used. It would be good to allow analysing 454 data with this pipeline as well.

Evaluation of requirements for analysing 454 data

Requirements

In order to allow standardized analysis of 454 data with nf-core/ampliseq, only minor additions would be needed. According to https://benjjneb.github.io/dada2/faq.html#can-i-use-dada2-with-my-454-or-ion-torrent-data, 454 sequencing data should be analysed with:

dada(..., HOMOPOLYMER_GAP_PENALTY=-1, BAND_SIZE=32)
filterAndTrim(..., maxLen=XXX) # XXX depends on the chemistry

Already available

The above is quite close to what IonTorrent data should be analysed with (and is implemented in nf-core/ampliseq with --iontorrent):

dada(..., HOMOPOLYMER_GAP_PENALTY=-1, BAND_SIZE=32)
filterAndTrim(..., trimLeft=15)

Usage of --iontorrent causes currently:

single end reads expected
expects that the forward and reverse primer is present in the read, see here
uses filterAndTrim with trimLeft = 15, see here
causes denoising with BAND_SIZE = 32, HOMOPOLYMER_GAP_PENALTY = -1, see here
taxonomic classification also with reverse complement, see here & here

I never had 454 data, but a short googling revealed its single end, primers seem to be typically expected at beginning and end, so that seems all fine. However, the setting --iontorrent is similar but imperfect for 454, because it includes trimLeft=15 which isnt recommended for 454 data.

Short term solution for analysing 454 data

Warning: not tested, theoretical solution! Feedback needed!

In the current pipeline (v2.7.1), one could easily overwrite the imperfection of --iontorrent with -c pyroseq.config where the config file pyroseq.config includes:

process {
    max_len = params.max_len ?: "Inf"
    withName: DADA2_FILTNTRIM {
        ext.args = [
            'maxN = 0, truncQ = 2, trimRight = 0, minQ = 0, rm.lowcomplex = 0, orient.fwd = NULL, matchIDs = FALSE, id.sep = "\\\\s", id.field = NULL, n = 1e+05, OMP = TRUE, qualityType = "Auto"',
            "maxEE = ${params.max_ee}",
            "minLen = ${params.min_len}, maxLen = $max_len, rm.phix = TRUE"
        ].join(',').replaceAll('(,)*$', "")
        publishDir = [
            path: { "${params.outdir}/dada2/args" },
            mode: params.publish_dir_mode,
            pattern: "*.args.txt"
        ]
    }
}

In addition, --max_len should be set appropriately.

So to conclude: Currently, for 454 data, use --iontorrent -c pyroseq.config --max_len <int> where the config file is described above and <int> depends on the chemistry.

Implementation

An additional parameter such as --454 could be added that almost mirrors --iontorrent settings except filterAndTrim(..., trimLeft=15).

Additional 454 test data and & usage documentation would need an update.

erikrikarddaniel commented 8 months ago

As soon as we know that Tobias' group (or someone else) gets this to work it would be great to add. If someone from the group would like to contribute, could be a perfect beginners task.

My only, very slight, comment is that perhaps params can't be all numbers?

d4straub commented 8 months ago

perhaps params can't be all numbers?

Quite possible! Never tested :D

nf-core / ampliseq