hariszaf / pema

PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S rRNA, ITS and COI marker genes
27 stars 12 forks source link

PEAR as merging algorithm in PEMA #42

Closed georgiatari closed 2 years ago

georgiatari commented 2 years ago

Choosing PEAR as a merging algorithm option in PANDAseq seems to cause disruption and failing in PEMA v. 2.1.4, while the steps before PANDAseq run fine. I mention the last lines of the output file, after SPADes is finished:

Adjusting sequences using the BayesHammer algorithm of SPAdes has been completed. Fatal error: /home/modules/preprocess.bds, line 334, pos 1. Map 'params' does not have key 'elimination'. pema_latest.bds, line 84 : merging(paramsSpadesMerging, globalVars) preprocess.bds, line 263 : string merging(string{} params, string{} globalVars){ preprocess.bds, line 287 : for ( string correctFile : correct ) { preprocess.bds, line 334 : task $globalVars{'path'}/tools/PANDAseq/bin/pandaseq -f $forwardFile -r $reverseFile -6 \

ProgramCounter.pop(100): Node ID does not match! PC : PC: size 9 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775 -> 4779 -> 4903 -> 4904 -> 4906 Node Id : 4927 bdsNode Id : 4906 ProgramCounter.pop(100): Node ID does not match! PC : PC: size 8 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775 -> 4779 -> 4903 -> 4904 Node Id : 4906 bdsNode Id : 4904 ProgramCounter.pop(100): Node ID does not match! PC : PC: size 7 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775 -> 4779 -> 4903 Node Id : 4904 bdsNode Id : 4903 ProgramCounter.pop(100): Node ID does not match! PC : PC: size 6 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775 -> 4779 Node Id : 4903 bdsNode Id : 4779 ProgramCounter.pop(100): Node ID does not match! PC : PC: size 5 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 -> 4775 Node Id : 4779 bdsNode Id : 4775 ProgramCounter.pop(100): Node ID does not match! PC : PC: size 4 / 0, nodes: 1422 -> 9771 -> 9772 -> 4554 Node Id : 4775 bdsNode Id : 4554 ProgramCounter.pop(100): Node ID does not match! PC : PC: size 3 / 0, nodes: 1422 -> 9771 -> 9772 Node Id : 4554 bdsNode Id : 9772 ProgramCounter.pop(100): Node ID does not match! PC : PC: size 2 / 0, nodes: 1422 -> 9771 Node Id : 9772 bdsNode Id : 9771 ProgramCounter.pop(100): Node ID does not match! PC : PC: size 1 / 0, nodes: 1422 Node Id : 9771 bdsNode Id : 1422

It is noted that elimination parameter is on as mentioned above, copied from the parameters file used:

################################################################ ################# PANDAseq (v. 2.11) ################### // https://storage.googleapis.com/pandaseq/pandaseq.html ################################################################ #

PANDAseq is the algorithm that PEMA uses in order to merge the paired-end reads.

PANDAseq has more than one merging algorithms.

#

Here, we set the algorithm used for assembly. The most common of them are:

pear --> uses the formula described in the PEAR paper (Zhang 2013), optionally with the probability of a random base (q) provide

simple_bayesian --> uses the formula described in the original paper (Masella 2012), optionally with an error estimation (ε) provided.

other options are stich, flash and more that you can fing in the above link.

# pandaseqAlgorithm pear # #

PANDAseq is a I/O bound algorithm. That means that it needs severous time in order to handle the ipnut and output files

while the process is quite fast. However, it does support multithreading and here you can set the number of threads it is going to use.

# pandaseqThreads 20 # #

The 'minlen' parameter sets the minimum length for a sequence, after primers are removed.

By default, all sequences are kept. With this option, sequences shorter than desired can be discarded.

In case you need to use this parameter, be sure you leave a tab after 'minlen' and set it like this: '-l 80'

If you do not want to use this parameter, please remove everything after the 'minlen'

# pandaseqMinlen
# #

The 'minoverlap' parameter sets the minimum overlap between forward and reverse reads.

By default, this is at least one nucleotide of overlap.

Raising this number does not generally increase the quality of the output as alignments with small overlaps tend to score poorly and are discarded anyway.

# minoverlap 12 # #

The 'threshold' parameter sets the score, between zero and one, that a sequence must meet to be kept in the output.

Any alignments lower than this will be discarded as low quality.

Increasing this number will not necessarily prevent uncalled bases (Ns) from appearing in the final sequence.

It is also used as the threshold to match primers, if primers are supplied. The default value is 0.6.

# threshold 0.6 # #

The '-N' parameter eliminates all sequences with uncalled nucleotides in the output.

Otherwise, during assembly, uncalled bases (Ns) from unpaired regions may be emitted.

If you need -N to be on your analysis, please add '-N' after 'elimination'. Please make sure you leave a tab.

If you do not want the parameter to be on, please make sure there is nothing after the 'elimination' parameter.

# elimination -N # #

PEMA performs the PANDAseq algorithm, with the -a and the -B parameters also on.

That it for striping the primers after assembly, rather than before and allowing input sequences to lack a barcode/tag correspondingly.

# #

Nevertheless, PEAR can run standalone in Zorbas with the following commands:

module load pear/0.9.11 pear -f -r -o

Could this be considered please, in order to be fixed?

hariszaf commented 2 years ago

Hi @georgiatari !

Thanks for mentioning that, however the true error seems to be the

Fatal error: /home/modules/preprocess.bds, line 334, pos 1. Map 'params' does not have key 'elimination'.

Could you please:

It's a bit tricky as if you set this parameter you need to make sure you have a tab and then -N otherwise, you have to remove completely everything from that line.

If none of these is the case, then could you change only the pear option to another one so we seen if that runs properly? Thanks again. :slightly_smiling_face: