marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

AS_UTL_symlink()-- Failed to make link: File exists #483

Closed steinjosh closed 7 years ago

steinjosh commented 7 years ago

Hi, I am stuck at the assembly stage of a ~450 Mb genome using canu version 1.4. Here is the original command:

canu -p cargold -d cargold.out genomeSize=450m -pacbio-raw ../20170*/*/Analysis_Results/*.fastq

Here is the stage I am stuck at:

-- Canu v0.0 (+0 commits) r0 unknown-hash-tag-no-repository-available.
-- Detected Java(TM) Runtime Environment '1.8.0_60' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 7' (from 'gnuplot') and image format 'svg'.
-- Detected 40 CPUs and 126 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found  57 hosts with  40 cores and  125 GB memory under Slurm control.
-- Found   3 hosts with 120 cores and 1513 GB memory under Slurm control.
-- Found   2 hosts with 120 cores and 1513 GB memory under Slurm control.
--
-- Allowed to run under grid control, and use up to   8 compute threads and   25 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and    6 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   25 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to  10 compute threads and   31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 450000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
--
-- Corrected reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.correctedReads.fasta.gz'.
--
--
-- BEGIN TRIMMING
--
--
-- Trimmed reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.trimmedReads.fasta.gz'
--
--
-- BEGIN ASSEMBLY
--
----------------------------------------
-- Starting command on Mon May  8 13:40:31 2017 with 1271464.103 GB free disk space

    /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition \
      -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore \
      -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 1 \
      -b 20000 \
      -p 64 \
    > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err 2>&1

-- Finished on Mon May  8 13:40:32 2017 (1 second) with 1271464.103 GB free disk space
----------------------------------------
ERROR:
ERROR:  Failed with exit code 1.  (rc=256)
ERROR:
================================================================================
Don't panic, but a mostly harmless error occurred and canu failed.

Disk space available:  1271464.103 GB

Last 50 lines of the relevant log file (/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err):

AS_UTL_symlink()-- Failed to make link '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.gkpStore/info' pointing to file '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore/info': File exists

canu failed with 'failed to partition the reads'.

I've tried re-starting using the original command, and get the same resulting error. Should I try to unlink the problem file (/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.gkpStore/info) and then restart?

Thanks for your help.

Josh

brianwalenz commented 7 years ago

Delete the whole partitionedReads.gkpStore directory, then restart. If it fails again, try running that 'gatekeeperPartition' command by hand (or you can just run it by hand first).

When this assembly finishes, upgrade to v1.5, too. v1.5 probably won't work with the current assembly, and you're nearly finished.

steinjosh commented 7 years ago

Hi, thanks for your help, I deleted partitionedReads.gkpStore directory and restarted and got this:

-- Canu v0.0 (+0 commits) r0 unknown-hash-tag-no-repository-available.
-- Detected Java(TM) Runtime Environment '1.8.0_60' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 7' (from 'gnuplot') and image format 'svg'.
-- Detected 40 CPUs and 126 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found  57 hosts with  40 cores and  125 GB memory under Slurm control.
-- Found   2 hosts with 120 cores and 1513 GB memory under Slurm control.
-- Found   3 hosts with 120 cores and 1513 GB memory under Slurm control.
--
-- Allowed to run under grid control, and use up to   8 compute threads and   25 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and    6 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   25 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to  10 compute threads and   31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 450000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
--
-- Corrected reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.correctedReads.fasta.gz'.
--
--
-- BEGIN TRIMMING
--
--
-- Trimmed reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.trimmedReads.fasta.gz'
--
--
-- BEGIN ASSEMBLY
--
----------------------------------------
-- Starting command on Mon May  8 15:55:42 2017 with 1271167.665 GB free disk space

    /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition \
      -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore \
      -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 1 \
      -b 20000 \
      -p 64 \
    > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err 2>&1
sh: line 5: 35249 Segmentation fault      /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 1 -b 20000 -p 64 > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err 2>&1

-- Finished on Mon May  8 15:55:43 2017 (1 second) with 1271167.665 GB free disk space
----------------------------------------
ERROR:
ERROR:  Failed with exit code 139.  (rc=35584)
ERROR:
================================================================================
Don't panic, but a mostly harmless error occurred and canu failed.

Disk space available:  1271167.665 GB

Last 50 lines of the relevant log file (/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err):

Partition 23 has 0 tigs and 19030 reads.
Partition 24 has 0 tigs and 19457 reads.
Partition 25 has 0 tigs and 12989 reads.
Partition 26 has 0 tigs and 26461 reads.
Partition 27 has 0 tigs and 17 reads.
Partition 28 has 0 tigs and 26049 reads.
Partition 29 has 0 tigs and 989 reads.
Partition 30 has 0 tigs and 29426 reads.
Partition 31 has 0 tigs and 4758 reads.
Partition 32 has 0 tigs and 29692 reads.
Partition 33 has 0 tigs and 1824 reads.
Partition 34 has 0 tigs and 27529 reads.
Partition 35 has 0 tigs and 6419 reads.
Partition 36 has 0 tigs and 17804 reads.
Partition 37 has 0 tigs and 17813 reads.
Partition 38 has 0 tigs and 20044 reads.
Partition 39 has 0 tigs and 19608 reads.
Partition 40 has 0 tigs and 16969 reads.
Partition 41 has 0 tigs and 19706 reads.
Partition 42 has 0 tigs and 3938 reads.
Partition 43 has 0 tigs and 19299 reads.
Partition 44 has 0 tigs and 16301 reads.
Partition 45 has 0 tigs and 13970 reads.
Partition 46 has 0 tigs and 17547 reads.
Partition 47 has 0 tigs and 16897 reads.
Partition 48 has 0 tigs and 17617 reads.

Failed with 'Segmentation fault'

Backtrace (mangled):

/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition(_Z17AS_UTL_catchCrashiP7siginfoPv+0x2d)[0x40341d]
/lib64/libpthread.so.0(+0xf790)[0x2acca22e8790]
/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition[0x40244d]
/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition[0x402ab3]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2acca2514d5d]
/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition[0x402019]

Backtrace (demangled):

[0] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition::AS_UTL_catchCrash(int, siginfo*, void*) + 0x2d  [0x40341d]
[1] /lib64/libpthread.so.0::(null) + 0xf790  [0x2acca22e8790]
[2] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition() [0x40244d]
[3] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition() [0x402ab3]
[4] /lib64/libc.so.6::(null) + 0xfd  [0x2acca2514d5d]
[5] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition() [0x402019]

GDB:

canu failed with 'failed to partition the reads'.

Next I ran the below command and restarted, and got an error similar to the original posting.

$ /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 1 -b 20000 -p 64 > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err 2>&1 &

I have also tried deleting the directory, running the above command, then restarting canu and still getting a similar error about failure to make a symlink.

Any additional suggestions?

Thanks, Josh

steinjosh commented 7 years ago

Hi Brian,

Thanks for your help. Unfortunately I am still stuck.

I deleted partitionedReads.gkpStore directory and restarted and got this:

-- Canu v0.0 (+0 commits) r0 unknown-hash-tag-no-repository-available. -- Detected Java(TM) Runtime Environment '1.8.0_60' (from 'java'). -- Detected gnuplot version '4.6 patchlevel 7' (from 'gnuplot') and image format 'svg'. -- Detected 40 CPUs and 126 gigabytes of memory. -- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo. -- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.

-- Found 57 hosts with 40 cores and 125 GB memory under Slurm control. -- Found 2 hosts with 120 cores and 1513 GB memory under Slurm control. -- Found 3 hosts with 120 cores and 1513 GB memory under Slurm control.

-- Allowed to run under grid control, and use up to 8 compute threads and 25 GB memory for stage 'bogart (unitigger)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 5 compute threads and 6 GB memory for stage 'read error detection (overlap error adjustment)'. -- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlap error adjustment'. -- Allowed to run under grid control, and use up to 8 compute threads and 25 GB memory for stage 'utgcns (consensus)'. -- Allowed to run under grid control, and use up to 1 compute thread and 3 GB memory for stage 'overlap store parallel bucketizer'. -- Allowed to run under grid control, and use up to 1 compute thread and 16 GB memory for stage 'overlap store parallel sorting'. -- Allowed to run under grid control, and use up to 1 compute thread and 3 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 8 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 8 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 10 compute threads and 31 GB memory for stage 'meryl (k-mer counting)'. -- Allowed to run under grid control, and use up to 4 compute threads and 12 GB memory for stage 'falcon_sense (read correction)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'minimap (overlapper)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'minimap (overlapper)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'minimap (overlapper)'.

-- This is canu parallel iteration #1, out of a maximum of 2 attempts.

-- Final error rates before starting pipeline:

-- genomeSize -- 450000000 -- errorRate -- 0.015

-- corOvlErrorRate -- 0.045 -- obtOvlErrorRate -- 0.045 -- utgOvlErrorRate -- 0.045

-- obtErrorRate -- 0.045

-- cnsErrorRate -- 0.045

-- -- BEGIN CORRECTION

-- -- Corrected reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.correctedReads.fasta.gz'.

-- -- BEGIN TRIMMING

-- -- Trimmed reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.trimmedReads.fasta.gz'

-- -- BEGIN ASSEMBLY


-- Starting command on Mon May 8 15:55:42 2017 with 1271167.665 GB free disk space

/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition \
  -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore \
  -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 1 \
  -b 20000 \
  -p 64 \
> /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err 2>&1

sh: line 5: 35249 Segmentation fault /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 1 -b 20000 -p 64 > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err 2>&1

-- Finished on Mon May 8 15:55:43 2017 (1 second) with 1271167.665 GB free disk space

ERROR: ERROR: Failed with exit code 139. (rc=35584) ERROR:

Don't panic, but a mostly harmless error occurred and canu failed.

Disk space available: 1271167.665 GB

Last 50 lines of the relevant log file (/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err):

Partition 23 has 0 tigs and 19030 reads. Partition 24 has 0 tigs and 19457 reads. Partition 25 has 0 tigs and 12989 reads. Partition 26 has 0 tigs and 26461 reads. Partition 27 has 0 tigs and 17 reads. Partition 28 has 0 tigs and 26049 reads. Partition 29 has 0 tigs and 989 reads. Partition 30 has 0 tigs and 29426 reads. Partition 31 has 0 tigs and 4758 reads. Partition 32 has 0 tigs and 29692 reads. Partition 33 has 0 tigs and 1824 reads. Partition 34 has 0 tigs and 27529 reads. Partition 35 has 0 tigs and 6419 reads. Partition 36 has 0 tigs and 17804 reads. Partition 37 has 0 tigs and 17813 reads. Partition 38 has 0 tigs and 20044 reads. Partition 39 has 0 tigs and 19608 reads. Partition 40 has 0 tigs and 16969 reads. Partition 41 has 0 tigs and 19706 reads. Partition 42 has 0 tigs and 3938 reads. Partition 43 has 0 tigs and 19299 reads. Partition 44 has 0 tigs and 16301 reads. Partition 45 has 0 tigs and 13970 reads. Partition 46 has 0 tigs and 17547 reads. Partition 47 has 0 tigs and 16897 reads. Partition 48 has 0 tigs and 17617 reads.

Failed with 'Segmentation fault'

Backtrace (mangled):

/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition(_Z17AS_UTL_catchCrashiP7siginfoPv+0x2d)[0x40341d] /lib64/libpthread.so.0(+0xf790)[0x2acca22e8790] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition[0x40244d] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition[0x402ab3] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2acca2514d5d] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition[0x402019]

Backtrace (demangled):

[0] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition::AS_UTL_catchCrash(int, siginfo, void) + 0x2d [0x40341d] [1] /lib64/libpthread.so.0::(null) + 0xf790 [0x2acca22e8790] [2] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition() [0x40244d] [3] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition() [0x402ab3] [4] /lib64/libc.so.6::(null) + 0xfd [0x2acca2514d5d] [5] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition() [0x402019]

GDB:

canu failed with 'failed to partition the reads'.

Next I ran the below command and restarted, and got an error similar to the original posting.

$ /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 1 -b 20000 -p 64 > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/partitionedReads.err 2>&1 &

I have also tried deleting the directory, running the above command, then restarting canu and still getting a similar error about failure to make a symlink.

Any additional suggestions?

Thanks, Josh

brianwalenz commented 7 years ago

Are you able to share this data? I'd need unitigging/cargold.gkpStore (the reads) and `unitigging/cargold.ctgStore' (the contig layouts).

The gatekeeperPartition command is failing, but I can't tell where based on the logging. Let's try varying parameters and see if we can get around it. Remove unitigging/cargold.ctgStore/partitionedReads.gkpStore (the output of this program) and try varying the -b (make it bigger, it's the number of reads in a partition) and -p (make it smaller, it's the number of partitions) settings:

/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition \
 -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore \
 -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 1 \
  -b 20000 -p 64

If values are found that work, set canu options cnsPartitionMin to the -b value and cnsPartitions to the -p value, remove that pesky partitionedReads.gkpStore directory again and restart canu.

steinjosh commented 7 years ago

Thanks again for your help. I have posted the data at this URL: http://de.cyverse.org/dl/d/CF9D14AE-15CA-453B-8FED-036296795526/stein_unitigging_for_troubleshooting.tar.gz.

I tried varying -b and -p (-b 25000 -p 50) and (-b 30000 -p 32) but really had no idea what was reasonable. Here is example error (below).

Thanks, Josh

$ /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition  -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore  -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 1   -b 25000 -p 50
For 966318 reads, will make 39 partitions with up to 24778 reads in each.
Partition 1 has 0 tigs and 35254 reads.
Partition 2 has 0 tigs and 37 reads.
Partition 3 has 0 tigs and 29217 reads.
Partition 4 has 0 tigs and 425 reads.
Partition 5 has 0 tigs and 25780 reads.
Partition 6 has 0 tigs and 4170 reads.
Partition 7 has 0 tigs and 24123 reads.
Partition 8 has 0 tigs and 23853 reads.
Partition 9 has 0 tigs and 19300 reads.
Partition 10 has 0 tigs and 15219 reads.
Partition 11 has 0 tigs and 14595 reads.
Partition 12 has 0 tigs and 13500 reads.
Partition 13 has 0 tigs and 14859 reads.
Partition 14 has 0 tigs and 23291 reads.
Partition 15 has 0 tigs and 19356 reads.
Partition 16 has 0 tigs and 17033 reads.
Partition 17 has 0 tigs and 17434 reads.
Partition 18 has 0 tigs and 24416 reads.
Partition 19 has 0 tigs and 21908 reads.
Partition 20 has 0 tigs and 24399 reads.
Partition 21 has 0 tigs and 24391 reads.
Partition 22 has 0 tigs and 11238 reads.
Partition 23 has 0 tigs and 26461 reads.
Partition 24 has 0 tigs and 17 reads.
Partition 25 has 0 tigs and 26049 reads.
Partition 26 has 0 tigs and 989 reads.
Partition 27 has 0 tigs and 29426 reads.
Partition 28 has 0 tigs and 4758 reads.
Partition 29 has 0 tigs and 29692 reads.
Partition 30 has 0 tigs and 1824 reads.
Partition 31 has 0 tigs and 27529 reads.
Partition 32 has 0 tigs and 24223 reads.
Partition 33 has 0 tigs and 17813 reads.
Partition 34 has 0 tigs and 20044 reads.
Partition 35 has 0 tigs and 19608 reads.
Partition 36 has 0 tigs and 20540 reads.
Partition 37 has 0 tigs and 20073 reads.
Partition 38 has 0 tigs and 24558 reads.
Partition 39 has 0 tigs and 14597 reads.
Partition 40 has 0 tigs and 23301 reads.
Partition 41 has 0 tigs and 21558 reads.
Partition 42 has 0 tigs and 21147 reads.

Failed with 'Segmentation fault'

Backtrace (mangled):

/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition(_Z17AS_UTL_catchCrashiP7siginfoPv+0x2d)[0x40341d]
/lib64/libpthread.so.0(+0xf790)[0x7f0e019f3790]
/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition[0x40244d]
/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition[0x402ab3]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0e0166ed5d]
/software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition[0x402019]

Backtrace (demangled):

[0] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition::AS_UTL_catchCrash(int, siginfo*, void*) + 0x2d  [0x40341d]
[1] /lib64/libpthread.so.0::(null) + 0xf790  [0x7f0e019f3790]
[2] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition() [0x40244d]
[3] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition() [0x402ab3]
[4] /lib64/libc.so.6::(null) + 0xfd  [0x7f0e0166ed5d]
[5] /software/apps/canu/gcc/64/1.4/bin/gatekeeperPartition() [0x402019]

GDB:

Segmentation fault
brianwalenz commented 7 years ago

Thanks, got the data, you can take it down. It wouldn't download with curl, stopping after 1gb, but worked with the browser, weird.

brianwalenz commented 7 years ago

There's an invalid contig in there - it claims to be 510kbp with 986 reads, but the data for the reads is missing. I could probably patch around it without too much trouble, but you'd lose whatever sequence this is. I don't really want to do the patch, and you don't want to lose the sequence.

Do you have time and compute to restart with v1.5? We've improved read correction, the gfa output and (if you run with the absolute latest code) consensus.

Incidentally, I see 4.30x of unassembled stuff (pretty high; d.mel has only 0.9x) and size statistics of:

lenContig ng10    14670728 bp   lg10       3   sum   47283037 bp
lenContig ng20    13393313 bp   lg20       6   sum   89437368 bp
lenContig ng30    11762495 bp   lg30       9   sum  127007647 bp
lenContig ng40     9395914 bp   lg40      13   sum  166124999 bp
lenContig ng50     7273226 bp   lg50      18   sum  206087363 bp
lenContig ng60     6345452 bp   lg60      24   sum  246221061 bp
lenContig ng70     4262020 bp   lg70      32   sum  283447190 bp
lenContig ng80     2608582 bp   lg80      43   sum  320877471 bp
lenContig ng90     1041226 bp   lg90      67   sum  360471359 bp
lenContig sum  383206180 (genomeSize 400000000)
lenContig num        236
lenContig ave    1623755

This hints that maybe the allowed error rate for assembly should be increased a bit. Not to sound like a cheerleader for the newer version, but....easier to do with v1.5. ;-)

brianwalenz commented 7 years ago

Further investigation - it looks like some bytes in the ctgStore data got set to zero. I've seen this occasionally, with no good resolution. Network problems? Storage problems? Who knows.

Remove 4-unitigger, 5-consensus, the ctgStore and utgStore, and any other miscellaneous crud created after the files in 4-unitigger ('ls -ltr' in unitigging/ will show these) and restart.

steinjosh commented 7 years ago

Hi Brian,

I am game to try v1.5. Given what you observed below suggesting need for increased error rate allowance, can you suggest a parameter setting? Also I had previously slogged through the correction phase, repeatedly restarting until all the reads were processed. Here is a history. $ grep 'correction jobs' canu.*.out canu.11.out:-- 321 read correction jobs failed: canu.12.out:-- 212 read correction jobs failed: canu.14.out:-- 182 read correction jobs failed: canu.15.out:-- 71 read correction jobs failed: canu.17.out:-- 31 read correction jobs failed: canu.18.out:-- 13 read correction jobs failed: canu.20.out:-- 7 read correction jobs failed: canu.21.out:-- 5 read correction jobs failed: canu.23.out:-- 3 read correction jobs failed: canu.24.out:-- 3 read correction jobs failed: canu.26.out:-- 3 read correction jobs failed: canu.27.out:-- 3 read correction jobs failed: canu.29.out:-- 3 read correction jobs failed: canu.30.out:-- 2 read correction jobs failed: canu.32.out:-- 1 read correction jobs failed: canu.33.out:-- 1 read correction jobs failed:

I would rather not repeat this “war of attrition" strategy. Is there a parameter I could set to avoid this?

I pasted an example below, and attached another example. The only other change I would make is genome size. I have seen a couple of other PacBio assemblies of this species just under 400 Mb, so I would change genome size to 400 instead of 450 Mb. I don’t know how much that would matter.

If you can, please make recommendations to this command: $ canu -p cargold -d cargold.out genomeSize=400m -pacbio-raw ../20170//Analysis_Results/*.fastq

Thanks, Josh

$ less canu.24.out -- Canu v0.0 (+0 commits) r0 unknown-hash-tag-no-repository-available. -- Detected Java(TM) Runtime Environment '1.8.0_60' (from 'java'). -- Detected gnuplot version '4.6 patchlevel 7' (from 'gnuplot') and image format 'svg'. -- Detected 40 CPUs and 126 gigabytes of memory. -- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo. -- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.

-- Found 57 hosts with 40 cores and 125 GB memory under Slurm control. -- Found 2 hosts with 120 cores and 1513 GB memory under Slurm control. -- Found 3 hosts with 120 cores and 1513 GB memory under Slurm control.

-- Allowed to run under grid control, and use up to 8 compute threads and 25 GB memory for stage 'bogart (unitigger)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 5 compute threads and 6 GB memory for stage 'read error detection (overlap error adjustment)'. -- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlap error adjustment'. -- Allowed to run under grid control, and use up to 8 compute threads and 25 GB memory for stage 'utgcns (consensus)'. -- Allowed to run under grid control, and use up to 1 compute thread and 3 GB memory for stage 'overlap store parallel bucketizer'. -- Allowed to run under grid control, and use up to 1 compute thread and 16 GB memory for stage 'overlap store parallel sorting'. -- Allowed to run under grid control, and use up to 1 compute thread and 3 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 8 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 8 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 10 compute threads and 31 GB memory for stage 'meryl (k-mer counting)'. -- Allowed to run under grid control, and use up to 4 compute threads and 12 GB memory for stage 'falcon_sense (read correction)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'minimap (overlapper)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'minimap (overlapper)'. -- Allowed to run under grid control, and use up to 10 compute threads and 13 GB memory for stage 'minimap (overlapper)'.

-- This is canu parallel iteration #3, out of a maximum of 2 attempts.

-- Final error rates before starting pipeline:

-- genomeSize -- 450000000 -- errorRate -- 0.015

-- corOvlErrorRate -- 0.045 -- obtOvlErrorRate -- 0.045 -- utgOvlErrorRate -- 0.045

-- obtErrorRate -- 0.045

-- cnsErrorRate -- 0.045

-- -- BEGIN CORRECTION

-- -- 3 read correction jobs failed: -- job /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/correction/2-correction/correction_outputs/0018.fasta FAILED. -- job /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/correction/2-correction/correction_outputs/0194.fasta FAILED. -- job /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/correction/2-correction/correction_outputs/0291.fasta FAILED.

Don't panic, but a mostly harmless error occurred and canu failed.

canu failed with 'failed to generate corrected reads. Made 2 attempts, jobs still failed'.

brianwalenz commented 7 years ago

In 1.5, it's correctedErrorRate. I don't have good suggestions for what to set it to, but testing is relatively quick - it's the same process as restarting I just wrote - remove 4-unitigger, etc and restart. This is something I've wanted to try to automate, maybe now is a good time. It amounts to running the 4-unitigger/unitigger.sh ('bogart' in there) with different settings then picking the 'best' result.

That is a lot of iterations. Do you know why the correction jobs were failing? Errors would be listed in the correction/2-correction/ outputs. It might just be a matter of increasing the allowed wall clock time for the jobs - gridOptions="-t 2-0" would set it to two days and zero hours.

Do the nodes have local scratch? This stage is VERY I/O intensive. Canu can copy the data to local scratch disk for much better performance. That you're getting stuff to run eventually hints it is load dependent - lightly loaded FS and the jobs run faster, heavily loaded and they slow to a crawl eventually getting killed. http://canu.readthedocs.io/en/latest/parameter-reference.html#file-staging

steinjosh commented 7 years ago

Okay, so I’ll give this a try first before switching to v1.5.

Josh

steinjosh commented 7 years ago

Hi Brian, I am still plugging away at this assembly using canu v1.4. Upon your earlier advice I deleted "4-unitigger, 5-consensus, the ctgStore and utgStore, and any other miscellaneous crud" and restarted with increased time allowance (-gridOptions="-t 2-0") and the assembly ran without apparent issues for the last week until yesterday.

I'm looking for advice on how to troubleshoot and re-start. I did not see anything revealing in the canu-scripts/canu.*.out files. Naively restarting led to immediate failure (tail-end of std-err):

----------------------------------------
-- Starting command on Tue May 16 08:55:39 2017 with 1266914.763 GB free disk space

    sbatch \
      --mem-per-cpu=4g \
      --cpus-per-task=1 \
      -t 2-0  \
      -D `pwd` \
      -J "canu_cargold" \
      -o /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/canu-scripts/canu.64.out /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/canu-scripts/canu.64.sh
sbatch: error: Batch script is empty!

-- Finished on Tue May 16 08:55:39 2017 (lickety-split) with 1266914.763 GB free disk space
----------------------------------------
ERROR:
ERROR:  Failed with exit code 1.  (rc=256)
ERROR:
================================================================================
Please panic.  canu failed, and it shouldn't have.

Stack trace:

 at /software/apps/canu/gcc/64/1.4/bin/lib/canu/Execution.pm line 1305
    canu::Execution::caFailure('Failed to submit script', undef) called at /software/apps/canu/gcc/64/1.4/bin/lib/canu/Execution.pm line 664
    canu::Execution::submitScript('/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out', 'cargold', undef) called at /software/apps/canu/gcc/64/1.4/bin/canu line 374

canu failed with 'Failed to submit script'.

Indeed, canu.64.sh was an empty file, so I deleted it. After further trials I also deleted canu.62.out and canu.63.out, which were also empty files.

-rwxrwxr-x 1 joshua.stein proj-gbru_fy17_002 52589 May 11 11:17 canu.61.sh
-rwxrwxr-x 1 joshua.stein proj-gbru_fy17_002 52607 May 11 11:17 canu.62.sh
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002  4788 May 11 11:17 canu.61.out
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002     0 May 13 11:17 canu.62.out
-rwxrwxr-x 1 joshua.stein proj-gbru_fy17_002 52625 May 13 11:17 canu.63.sh
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002     0 May 15 11:17 canu.63.out
-rwxrwxr-x 1 joshua.stein proj-gbru_fy17_002     0 May 16 08:52 canu.64.sh

After deleting:

-rwxrwxr-x 1 joshua.stein proj-gbru_fy17_002 52589 May 11 11:17 canu.61.sh
-rwxrwxr-x 1 joshua.stein proj-gbru_fy17_002 52607 May 11 11:17 canu.62.sh
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002  4788 May 11 11:17 canu.61.out
-rwxrwxr-x 1 joshua.stein proj-gbru_fy17_002 52625 May 13 11:17 canu.63.sh

Also here is what canu.61.out looks like:

$ less canu.61.out
-- Canu v0.0 (+0 commits) r0 unknown-hash-tag-no-repository-available.
-- Detected Java(TM) Runtime Environment '1.8.0_60' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 7' (from 'gnuplot') and image format 'svg'.
-- Detected 40 CPUs and 126 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found  59 hosts with  40 cores and  125 GB memory under Slurm control.
-- Found   3 hosts with 120 cores and 1513 GB memory under Slurm control.
-- Found   2 hosts with 120 cores and 1513 GB memory under Slurm control.
--
-- Allowed to run under grid control, and use up to   8 compute threads and   25 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and    6 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   25 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to  10 compute threads and   31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 450000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
--
-- Corrected reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.correctedReads.fasta.gz'.
--
--
-- BEGIN TRIMMING
--
--
-- Trimmed reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.trimmedReads.fasta.gz'
--
--
-- BEGIN ASSEMBLY
--
-- Unitigger attempt 1 begins.
----------------------------------------
-- Starting command on Thu May 11 11:17:04 2017 with 1267430.193 GB free disk space

      sbatch \
        -t 2-0  --mem-per-cpu=3200m --cpus-per-task=8 \
        -D `pwd` -J "bat_cargold" \
        -a 1-1 \
        -o /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/4-unitigger/unitigger.%A_%a.out \
        /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/4-unitigger/unitigger.sh 0

Submitted batch job 95871

-- Finished on Thu May 11 11:17:04 2017 (lickety-split) with 1267430.193 GB free disk space
----------------------------------------
----------------------------------------
-- Starting command on Thu May 11 11:17:04 2017 with 1267430.193 GB free disk space

    sbatch \
      --mem-per-cpu=4g \
      --cpus-per-task=1 \
      -t 2-0  \
      --depend=afterany:95871 \
      -D `pwd` \
      -J "canu_cargold" \
      -o /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/canu-scripts/canu.62.out /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/canu-scripts/canu.62.sh
Submitted batch job 95872

-- Finished on Thu May 11 11:17:04 2017 (lickety-split) with 1267430.193 GB free disk space

Anyhow, I tried to restart after deleting the empty *.out files and this time the job submitted, then quickly failed with output of a new canu.62.out:

$ less cargold.out/canu-scripts/canu.62.out
/var/spool/slurmd/job99096/slurm_script: line 27: unexpected EOF while looking for matching `"'
/var/spool/slurmd/job99096/slurm_script: line 28: syntax error: unexpected end of file

The most recent log files from canu-logs directory looks like this:

-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002 701 May 13 11:17 1494692252_sn-cn-15-3.scinet.ars.usda.gov_31658_bogart
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002 292 May 13 11:17 1494692252_sn-cn-15-3.scinet.ars.usda.gov_31431_canu
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002   0 May 15 11:17 1494865058_sn-cn-15-3.scinet.ars.usda.gov_28191_canu
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002   0 May 16 08:48 1494942527_sn-cn-8-1.scinet.ars.usda.gov_38188_canu
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002   0 May 16 08:52 1494942776_sn-cn-8-1.scinet.ars.usda.gov_2185_canu
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002   0 May 16 08:55 1494942939_sn-cn-8-1.scinet.ars.usda.gov_2701_canu
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002   0 May 17 07:46 1495025169_sn-cn-8-1.scinet.ars.usda.gov_17492_canu

The final non-empty log file looks like this:

$ less cargold.out/canu-logs/1494692252_sn-cn-15-3.scinet.ars.usda.gov_31431_canu

###
###  Reading options from '/software/apps/canu/gcc/64/1.4/bin/canu.defaults'
###

# Add site specific options (for setting up Grid or limiting memory/threads) here.

###
###  Reading options from the command line.
###

genomeSize=450m
-gridOptions=-t 2-0
canuIteration=1
canuIteration=2

Here is what the /unitigging/ directory looks like:

-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002      301 Apr 20 13:05 cargold.gkpStore.gkp
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002      565 Apr 20 13:11 cargold.gkpStore.err
drwxrwsr-x 2 joshua.stein proj-gbru_fy17_002     4096 Apr 20 13:11 cargold.gkpStore
drwxrwsr-x 2 joshua.stein proj-gbru_fy17_002     4096 Apr 20 13:41 0-mercounts
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002        0 Apr 20 14:38 cargold.ovlStore.summary.err
drwxrwsr-x 2 joshua.stein proj-gbru_fy17_002    12288 Apr 20 14:38 1-overlapper
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002     1738 Apr 20 14:38 cargold.ovlStore.summary
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002 21738495 Apr 20 14:38 cargold.ovlStore.per-read.log
drwxrwsr-x 4 joshua.stein proj-gbru_fy17_002    12288 Apr 20 15:16 cargold.ovlStore
drwxrwsr-x 2 joshua.stein proj-gbru_fy17_002    32768 Apr 20 15:16 3-overlapErrorAdjustment
drwxrwsr-x 2 joshua.stein proj-gbru_fy17_002     4096 May 13 11:17 4-unitigger

Here is what the 4-unitigger/ directory looks like:

-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002       95 May 10 18:23 unitigger.95391_1.out
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002       94 May 11 10:24 unitigger.95444_1.out
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002       94 May 11 11:17 unitigger.95871_1.out
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002      326 May 13 11:17 unitigger.jobSubmit.sh
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002        0 May 13 11:17 cargold.001.filterOverlaps.thr000.num000.log
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002    72746 May 13 11:18 cargold.best.spurs
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002     4731 May 13 11:18 unitigger.err
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002        0 May 13 11:18 cargold.best.singletons
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002        0 May 13 11:18 cargold.best.edges.suspicious
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002        0 May 13 11:18 cargold.best.edges.histogram
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002        0 May 13 11:18 cargold.best.edges.gfa
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002 12582912 May 13 11:18 cargold.best.edges
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002        0 May 13 11:18 cargold.best.contains.histogram
-rw-rw-r-- 1 joshua.stein proj-gbru_fy17_002       95 May 15 11:17 unitigger.97050_1.out
-rwxrwxr-x 1 joshua.stein proj-gbru_fy17_002        0 May 15 11:17 unitigger.sh

Here is output from unitigger.97050_1.out:

$ less unitigger.97050_1.out
slurmstepd: *** JOB 97050 CANCELLED AT 2017-05-15T11:17:36 DUE TO TIME LIMIT on sn-cn-15-3 ***

Anyway.... I am guessing I need to clean up a few more files, and retry? Any advice you have would be most welcome.

Thanks again, Josh

brianwalenz commented 7 years ago

The size zero files suggest you're hitting a disk quota limit. This is possibly consistent with exceeding the time limit. The unitigger.sh command (the 'bogart' binary) tried writing files, hit a quota limit, and stalled for two days. This command should be taking far less than that; drosophila needs ten minutes or so.

There's a stray dash in the command (I'm shocked canu didn't complain) "-gridOptions=..." should just be "gridOptions=...". Canu seems to have accepted it, I see "-t 2-0" in the sbatch commands.

steinjosh commented 7 years ago

Yes you were correct, I exceeded the disc quota grace period, and now have an extension. Anyhow, the assembly has progressed to the 5-consensus stage. Now encountered the following in canu.65.out:

-- Canu v0.0 (+0 commits) r0 unknown-hash-tag-no-repository-available.
-- Detected Java(TM) Runtime Environment '1.8.0_60' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 7' (from 'gnuplot') and image format 'svg'.
-- Detected 40 CPUs and 126 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found  58 hosts with  40 cores and  125 GB memory under Slurm control.
-- Found   2 hosts with 120 cores and 1513 GB memory under Slurm control.
-- Found   3 hosts with 120 cores and 1513 GB memory under Slurm control.
--
-- Allowed to run under grid control, and use up to   8 compute threads and   25 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and    6 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   25 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to  10 compute threads and   31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to  10 compute threads and   13 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #3, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 450000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
--
-- Corrected reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.correctedReads.fasta.gz'.
--
--
-- BEGIN TRIMMING
--
--
-- Trimmed reads saved in '/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/cargold.trimmedReads.fasta.gz'
--
--
-- BEGIN ASSEMBLY
--
-- Configured 58 contig and 44 unitig consensus jobs.
-- All 102 consensus jobs finished successfully.
----------------------------------------
-- Starting command on Mon May 22 15:27:22 2017 with 1265512.574 GB free disk space

    /software/apps/canu/gcc/64/1.4/bin/tgStoreLoad \
      -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore \
      -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 2 \
      -L /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/5-consensus/ctgcns.files \
    > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/5-consensus/ctgcns.files.ctgStoreLoad.err 2>&1

-- Finished on Mon May 22 15:28:22 2017 (60 seconds) with 1265508.467 GB free disk space
----------------------------------------
----------------------------------------
-- Starting command on Mon May 22 15:28:22 2017 with 1265508.467 GB free disk space

    /software/apps/canu/gcc/64/1.4/bin/tgStoreLoad \
      -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore \
      -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.utgStore 2 \
      -L /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/5-consensus/utgcns.files \
    > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/5-consensus/utgcns.files.utgStoreLoad.err 2>&1

-- Finished on Mon May 22 15:28:35 2017 (13 seconds) with 1265507.588 GB free disk space
----------------------------------------
-- Purging consensus output after loading to ctgStore and/or utgStore.
-- Purged 102 .cns outputs.
----------------------------------------
-- Starting command on Mon May 22 15:28:35 2017 with 1265507.588 GB free disk space

    /software/apps/canu/gcc/64/1.4/bin/tgStoreDump \
      -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore \
      -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 2 \
      -sizes -s 450000000 \
    > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/seqDB.v002.sizes.txt
ERROR: gappedLen = 45541 >= gappedMax = 0
tgStoreDump: stores/tgTig.C:288: void tgTig::buildUngapped(): Assertion `_gappedLen < _gappedMax' failed.

Failed with 'Aborted'

Backtrace (mangled):

/software/apps/canu/gcc/64/1.4/bin/tgStoreDump(_Z17AS_UTL_catchCrashiP7siginfoPv+0x2d)[0x41fc2d]
/lib64/libpthread.so.0(+0xf790)[0x2adf55204790]
/lib64/libc.so.6(gsignal+0x35)[0x2adf55444625]
/lib64/libc.so.6(abort+0x175)[0x2adf55445e05]
/lib64/libc.so.6(+0x2b74e)[0x2adf5543d74e]
/lib64/libc.so.6(__assert_perror_fail+0x0)[0x2adf5543d810]
/software/apps/canu/gcc/64/1.4/bin/tgStoreDump[0x42a076]
/software/apps/canu/gcc/64/1.4/bin/tgStoreDump[0x40838a]
/software/apps/canu/gcc/64/1.4/bin/tgStoreDump[0x40d018]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2adf55430d5d]
/software/apps/canu/gcc/64/1.4/bin/tgStoreDump[0x402509]

Backtrace (demangled):

[0] /software/apps/canu/gcc/64/1.4/bin/tgStoreDump::AS_UTL_catchCrash(int, siginfo*, void*) + 0x2d  [0x41fc2d]
[1] /lib64/libpthread.so.0::(null) + 0xf790  [0x2adf55204790]
[2] /lib64/libc.so.6::(null) + 0x35  [0x2adf55444625]
[3] /lib64/libc.so.6::(null) + 0x175  [0x2adf55445e05]
[4] /lib64/libc.so.6::(null) + 0x2b74e  [0x2adf5543d74e]
[5] /lib64/libc.so.6::(null) + 0  [0x2adf5543d810]
[6] /software/apps/canu/gcc/64/1.4/bin/tgStoreDump() [0x42a076]
[7] /software/apps/canu/gcc/64/1.4/bin/tgStoreDump() [0x40838a]
[8] /software/apps/canu/gcc/64/1.4/bin/tgStoreDump() [0x40d018]
[9] /lib64/libc.so.6::(null) + 0xfd  [0x2adf55430d5d]
[10] /software/apps/canu/gcc/64/1.4/bin/tgStoreDump() [0x402509]

GDB:

sh: line 4:  8273 Aborted                 /software/apps/canu/gcc/64/1.4/bin/tgStoreDump -G /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.gkpStore -T /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore 2 -sizes -s 450000000 > /project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/cargold.ctgStore/seqDB.v002.sizes.txt

-- Finished on Mon May 22 15:28:36 2017 (1 second) with 1265507.588 GB free disk space
----------------------------------------
ERROR:
ERROR:  Failed with exit code 134.  (rc=34304)
ERROR:
================================================================================
Don't panic, but a mostly harmless error occurred and canu failed.

canu failed with 'failed to generate unitig sizes'.

slurmstepd: Exceeded step memory limit at some point.

Can you recommend any tweaks to the parameters before I restart? As a reminder, here is the command I am using:

canu -p cargold -d cargold.out genomeSize=450m -pacbio-raw ../20170*/*/Analysis_Results/*.fastq gridOptions="-t 2-0"

Grateful for the help so far. Thanks. Josh

brianwalenz commented 7 years ago

Sorry, you got caught by a holiday.

I think you're hitting the same problem you had initially - a block of corrupt on disk data that v1.4 fails to catch when it loads the data.

Are there still logs in the 5-consensus directory? The results were purged, but I don't remember if logs are removed too. An explanation could be that one of the consensus jobs had a short write, but otherwise finished with success.

The 'slurmstepd: Exceeded step memory limit at some point' warning troubles me. I don't see any impact from this here though.

steinjosh commented 7 years ago

No worries, thanks for getting back. I had previously deleted 4-unitigger, 5-consensus, and the ctgStore and utgStore, before restarting about 3 weeks ago. Currently the 5-consensus folder has: 102 consensus.*.out files ctgcns.files.ctgStoreLoad.err and utgcns.files.utgStoreLoad.err (are these the log files you referred to?) ctgcns.files and utgcns.files which list paths such as "/project/gbru_fy17_002/pacbio_fastq/canu_assembly/cargold.out/unitigging/5-consensus/ctgcns/0055.cns" directories ctgcns and utgcns which are empty! Thus the files listed in ctgcns.files and utgcns.files don't exist.

Josh

brianwalenz commented 7 years ago

Yup, the .out files are what I'm after. Scan (the end of) those for any errors. `tail -1 out` should be enough to show abnormal outputs. Everything should end with "Bye".

The flow of the program here is to compute, in parallel, consensus sequences. These write *.cns files, which are then copied into ctgStore or utgStore. The outputs of the assembler are then generated from those stores. So: what I'm thinking is that either some of the consensus jobs wrote bad outputs, or the copy failed. Both of those errors should have been caught, so we're still kind of fishing around for something that broke.

Instead of that, I'd just rerun the 5-consensus jobs (using the consensus.jobSubmit.sh script in there) then grab an interactive login and rebuild the two stores manually (the two tgStoreLoad commands) and then the tgStoreDump command. Once that works, you can restart canu and let it finish up. The benefit of running manually is that canu won't delete those *.cns files, which will let us figure out what is breaking if it is still broken.

steinjosh commented 7 years ago

Yeah, so 8 out of the 102 *.out files do not end in the "Bye" statement. Funny thing is that all of these actually contain the line 'Consensus finished successfully. Bye.' somewhere before the end, so writing continued after declaring it was finished. For example:

$ tail -n 2 consensus.101203_89.out
Consensus finished successfully.  Bye.
generatePBDAG()-- WARNING: template position 13064 not defined.

In the most extreme case the "Bye" statement occurs on line 82429274 but continued to write 56823 additional lines until end of file.

brianwalenz commented 7 years ago

By now, I hope you either gave up, cursing Canu and the idiots that wrote it, or completed an assembly and are well along to way to ending famine.