hariszaf / pema

PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S rRNA, ITS and COI marker genes
27 stars 12 forks source link

Error PEMA ASV inference: Directory swarm does not exists #59

Closed savvas-paragkamian closed 11 months ago

savvas-paragkamian commented 1 year ago

Error with the PEMA ASV inference. Possibly due to spelling error.

The line that possibly was skipped because of spelling error in the parameters file.

} else if (paramsDereplication{'clusteringAlgo'} == 'algo_Swarm') {

In the parameters file I write clusteringAlgo algo_swarm, while it is suggested to write (write "Swarm" or "vsearch" or "CROP" after algo_).

In the initialize.bds script there is a line that creates the folder Swarm.

The error

Fatal error: /home/modules/taxAssignment.bds, line 11. Directory '/mnt/analysis/isd_crete_2016_20230823/7.mainOutput/gene_16S/swarm' does not exists
pema_latest.bds, line 156 :     if ( paramsForTaxAssign{'custom_ref_db'} != 'Yes'){
pema_latest.bds, line 158 :        if ( paramsForTaxAssign{'gene'} == 'gene_16S') {
pema_latest.bds, line 170 :           if (paramsForTaxAssign{'taxonomyAssignmentMethod'} != 'phylogeny') {
pema_latest.bds, line 172 :              crestAssign(paramsForTaxAssign, globalVars)
taxAssignment.bds, line 4 :     string crestAssign(string{} params, string{} globalVars) {
taxAssignment.bds, line 6 :        if ( params{'custom_ref_db'} != 'Yes') {
taxAssignment.bds, line 9 :           if ( (params{'gene'} == 'gene_16S' || params{'gene'} == 'gene_18S') && params{'taxonomyAssignmentMethod'} != 'phylogeny' ) {
taxAssignment.bds, line 11 :             globalVars{'assignmentPath'}.chdir()

The parameters file: parameters0f.isd_crete_2016_20230823.txt

hariszaf commented 1 year ago

Hi @savvas-paragkamian and thanks for sharing.

I am a bit confused though, have you tried using clusteringAlgo algo_Swarm as suggested ?

If yes, do you also get an error then ?

savvas-paragkamian commented 1 year ago

Yes I write clusteringAlgo algo_swarm, but it is suggested to write clusteringAlgo algo_Swarm.

In the initialize script also it is written:

      } else if ( params{'clusteringAlgo'} == 'algo_Swarm' ) {
         string algo = 'Swarm'
         algo.mkdir()
      }

Maybe that's why the folder wasn't created?

Currently, I run the analysis from the dereplication step checkpoint. I have manually created both folders (777 permissions) i.e. Swarm, swarm in the directory 7.mainOutput/gene_16.

In addition I changed the parameters file to algo_Swarm.

My question is, if I change the parameters file and continue the analysis from the checkpoint, PEMA reads the new changes of the parameters?

hariszaf commented 1 year ago

Let's break this down to steps! :wink: so, first, could you please give it a shot with just a few numbers of samples (2-3) setting the clusteringAlgo to algo_Swarm ?

If you have already tried so, did you have an error ?

if I change the parameters file and continue the analysis from the checkpoint, PEMA reads the new changes of the parameters?

in the beginning of each checkpoint, pema reads the parameters file thanks to the readParameterFile() function

savvas-paragkamian commented 1 year ago

The new job with 4 samples and with the correct algo_Swarm parameter works!

The job from the checkpoint failed with the following error:

bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
/home/scripts/dereplicateSwarm.sh: line 99: 29689 Killed                  awk 'BEGIN {FS = "[>_]"}

     # Parse the sample files
     /^>/ {contingency[$2][FILENAME] = $3
           amplicons[$2] += $3
           if (FNR == 1) {
               samples[++i] = FILENAME
           }
          }

     END {# Create table header
          printf "amplicon"
          s = length(samples)
          for (i = 1; i <= s; i++) {
              printf "\t%s", samples[i]
          }
          printf "\t%s\n", "total"

          # Sort amplicons by decreasing total abundance (use a coprocess)
          command = "LC_ALL=C sort -k1,1nr -k2,2d"
          for (amplicon in amplicons) {
               printf "%d\t%s\n", amplicons[amplicon], amplicon |& command
          }
          close(command, "to")
          FS = "\t"
          while ((command |& getline) > 0) {
              amplicons_sorted[++j] = $2
          }
          close(command)

          # Print the amplicon occurrences in the different samples
          n = length(amplicons_sorted)
          for (i = 1; i <= n; i++) {
               amplicon = amplicons_sorted[i]
               printf "%s", amplicon
               for (j = 1; j <= s; j++) {
                   printf "\t%d", contingency[amplicon][samples[j]]
               }
               printf "\t%d\n", amplicons[amplicon]
          }}' linearized.dereplicate* > ../amplicon_contingency_table.tsv
Fatal error: /home/modules/preprocess.bds, line 449, pos 5. Exec failed.
    Exit value : 137
    Command    :  bash /home/scripts/dereplicateSwarm.sh
pema_latest.bds, line 103 : if ( paramsDereplication{'clusteringAlgo'} == 'algo_Swarm' ) {
pema_latest.bds, line 106 :    swarmDereplicate(paramsDereplication, globalVars)
preprocess.bds, line 445 :  string swarmDereplicate(string{} params, string{} globalVars){
preprocess.bds, line 449 :      sys bash $globalVars{'path'}/scripts/dereplicateSwarm.sh
hariszaf commented 1 year ago

Thanks for sharing @savvas-paragkamian . The issue is with the global parametes that are set in the initialization and do not change after each checkpoint.

I ll fix that as part of pema v.2.1.5 and reach back as soon as it's released