marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

Canu v.2.2: Failed to generate corrected reads! #2313

Closed Voseck closed 5 months ago

Voseck commented 6 months ago

Hello!

This is my first time using Canu (and posting in Github). I'm currently assembling a bacterial genome employing whole-genome reads from PacBio, therefore, I execute the default code (correction, trimming and assembly). However, I've encountered an issue where no corrected reads are being generated. I've followed the instructions carefully and even tried it with the Canu tutorial using E. coli K12 genome, suspecting a potential error in my execution of the code or a problem with my reads, but I'm still stuck with the same problem. Please help!

This is the script I'm using to run the code:

#!/bin/bash

#SBATCH --job-name=Canu                 #Nombre del job
#SBATCH -p bigmem                       #Cola a usar, Default=short (Ver colas y límites en /hpcfs/shared/README/partitions.txt)
#SBATCH -N 1                            #Nodos requeridos, Default=1
#SBATCH -n 1                            #Tasks paralelos, recomendado para MPI, Default=1
#SBATCH --cpus-per-task=32              #Cores requeridos por task, recomendado para multi-thread, Default=1
#SBATCH --mem=32000                     #Memoria en Mb por CPU, Default=2048
#SBATCH --time=300:00:00                #Tiempo máximo de corrida, Default=2 horas
#SBATCH -o TEST_Canu                    #Nombre de archivo de salida

#canu [-haplotype|-correct|-trim] \
#   [-s <assembly-specifications-file>] \
#   -p <assembly-prefix> \
#   -d <assembly-directory> \
#   genomeSize=<number>[g|m|k] \
#   [other-options] \
#   [-trimmed|-untrimmed|-raw|-corrected] \
#   [-pacbio|-nanopore|-pacbio-hifi] *fastq

module load canu/2.2

#curl -L -o pacbio.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq

canu -p ecoli -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq

echo done

These are all the files generated by running Canu:

Screenshot 2024-05-04 at 12 40 12 PM

And the report generated upon completing the job:

-- Canu v0.0 (+0 commits) r0 unknown-hash-tag-no-repository-available.
-- Detected Java(TM) Runtime Environment '1.8.0_92' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 0' (from 'gnuplot') and image format 'svg'.
-- Detected 40 CPUs and 565 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found  10 hosts with  40 cores and  565 GB memory under Slurm control.
-- Found   2 hosts with  48 cores and  503 GB memory under Slurm control.
-- Found   1 host  with  48 cores and 1007 GB memory under Slurm control.
-- Found   3 hosts with  48 cores and  251 GB memory under Slurm control.
-- Found   2 hosts with  32 cores and  187 GB memory under Slurm control.
--
-- Allowed to run under grid control, and use up to   4 compute threads and   16 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    2 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    1 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   4 compute threads and   20 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    4 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    8 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    5 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    8 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   2 compute threads and   10 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 4800000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
-- mhap precompute attempt 1 begins with 0 finished, and 2 to compute.
----------------------------------------
-- Starting command on Sat May  4 09:38:07 2024 with 54076.687 GB free disk space

      sbatch \
        --mem-per-cpu=768m --cpus-per-task=8 \
        -D `pwd` -J "cormhap_ecoli" \
        -a 1-2 \
        -o /hpcfs/home/ciencias_biologicas/cj.rodriguez12/04_Polish/03_Tutorialcanu/ecoli_pacbio/correction/1-overlapper/precompute.%A_%a.out \
        /hpcfs/home/ciencias_biologicas/cj.rodriguez12/04_Polish/03_Tutorialcanu/ecoli_pacbio/correction/1-overlapper/precompute.sh 0

Submitted batch job 214549

-- Finished on Sat May  4 09:38:07 2024 (lickety-split) with 54076.687 GB free disk space
----------------------------------------
----------------------------------------
-- Starting command on Sat May  4 09:38:07 2024 with 54076.687 GB free disk space

    sbatch \
      --mem-per-cpu=8g \
      --cpus-per-task=1   \
      --depend=afterany:214549 \
      -D `pwd` \
      -J "canu_ecoli" \
      -o /hpcfs/home/ciencias_biologicas/cj.rodriguez12/04_Polish/03_Tutorialcanu/ecoli_pacbio/canu-scripts/canu.04.out /hpcfs/home/ciencias_biologicas/cj.r
odriguez12/04_Polish/03_Tutorialcanu/ecoli_pacbio/canu-scripts/canu.04.sh
Submitted batch job 214550

-- Finished on Sat May  4 09:38:07 2024 (lickety-split) with 54076.687 GB free disk space
----------------------------------------
done

I'm presenting a similar problem when I run Canu with my PacBio reads. My apologies in advance if this issue has been previously addressed in other posts. Thanks.

skoren commented 6 months ago

Canu runs by submitting itself to the grid: https://canu.readthedocs.io/en/latest/faq.html#how-do-i-run-canu-on-my-slurm-sge-pbs-lsf-torque-system so the initial job exits as soon as that's done. It's still running on the grid in the background. If you launched it more than once with the same -d though they would collide and cause issues. If that was the case, you'd want to make sure any canu jobs on your cluster are killed, remove the -d folder completely, and run it again. If you only launched it once, it should be running on the grid with the latest log being in canu.out

Voseck commented 6 months ago

Thank you for your prompt response. I apologize for my lack of knowledge, but could you please clarify when you mention:

Canu runs by submitting itself to the grid: https://canu.readthedocs.io/en/latest/faq.html#how-do-i-run-canu-on-my-slurm-sge-pbs-lsf-torque-system so the initial job exits as soon as that's done. It's still running on the grid in the background.

Are you implying that if Canu is running, it should be visible in the background of the Cluster? I attempted to verify this by using squeue -u "my user ID", but no running jobs are displayed; it's just blank. Therefore, I assume the job is completed, right?. Even when I use "scontrol show job <214602>", the output appears as follows:

JobId=214602 JobName=canu_CIX4198
   UserId=cj.rodriguez12(10325) GroupId=ciencias_biologicas(10101) MCS_label=N/A
   Priority=12381 Nice=0 Account=local QOS=normal WCKey=*default
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2024-05-04T17:16:30 EligibleTime=2024-05-04T17:17:49
   AccrueTime=Unknown
   StartTime=2024-05-04T17:17:49 EndTime=2024-05-04T17:17:55 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-05-04T17:17:49
   Partition=short AllocNode:Sid=nodei-3:2191089
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nodea-6
   BatchHost=nodea-6
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=8G,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/hpcfs/home/ciencias_biologicas/cj.rodriguez12/04_Polish/01_PacbioGenomes/CIX4198-pacbio/canu-scripts/canu.01.sh
   WorkDir=/hpcfs/home/ciencias_biologicas/cj.rodriguez12/04_Polish/01_PacbioGenomes/CIX4198-pacbio
   StdErr=/hpcfs/home/ciencias_biologicas/cj.rodriguez12/04_Polish/01_PacbioGenomes/CIX4198-pacbio/canu-scripts/canu.01.out
   StdIn=/dev/null
   StdOut=/hpcfs/home/ciencias_biologicas/cj.rodriguez12/04_Polish/01_PacbioGenomes/CIX4198-pacbio/canu-scripts/canu.01.out
   Power=
   NtasksPerTRES:0

I did what you suggest, to make sure that any canu jobs in the cluster are killed and to remove the -d folder completely, and run it again. I dit it, and I present the same problem.

Looking forward for your response

skoren commented 6 months ago

The logs from the last submitted job should be in canu.out, please post that along with the report file.

skoren commented 5 months ago

Idle, no reply.