Can't open 'unitigging/5-consensus/consensus.sh' for writing

quanc1989 commented 5 years ago

Hi, when I run "slurm.scripts" on a grid as follows:

#!/bin/bash
#SBATCH -N1
#SBATCH --qos=debug
#SBATCH --job-name=assemble
#SBATCH --array=10-20

chrom='chr'$SLURM_ARRAY_TASK_ID

for fname in $(ls /lustre/user/snp/quanc/projects/20190426_nanopore_sv/assembly/fasta_split/$chrom/);do

        prefix=${fname%'.fasta'}

         canu  -fast 
               -p $prefix
               -d /lustre/user/snp/quanc/projects/20190426_nanopore_sv/assembly/canu/$chrom/$prefix 
               useGrid=false 
               genomeSize=60k 
               corMhapSensitivity=high
               corMinCoverage=2 
               correctedErrorRate=0.105 
               -nanopore-raw /lustre/user/snp/quanc/projects/20190426_nanopore_sv/assembly/fasta_split/$chrom/$fname;
done

then

sbatch slurm.scripts

some times ( frequency is approximately equal to 0.5 ) I got this 'ABORT'


-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'bat' concurrent execution on Thu Nov  7 13:59:20 2019 with 311000.283 GB free disk space (1 processes; 1 concurrently)

    cd unitigging/4-unitigger
    ./unitigger.sh 1 > ./unitigger.000001.out 2>&1

-- Finished on Thu Nov  7 13:59:23 2019 (3 seconds) with 311000.284 GB free disk space
----------------------------------------
-- Unitigger finished successfully.
-- Found, in version 1, after unitig construction:
--   contigs:      1 sequences, total length 151885 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  100 sequences, total length 1226730 bp.
--
-- Contig sizes based on genome size 60kbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10      151885             1      151885
--     20      151885             1      151885
--     30      151885             1      151885
--     40      151885             1      151885
--     50      151885             1      151885
--     60      151885             1      151885
--     70      151885             1      151885
--     80      151885             1      151885
--     90      151885             1      151885
--    100      151885             1      151885
--    110      151885             1      151885
--    120      151885             1      151885
--    130      151885             1      151885
--    140      151885             1      151885
--    150      151885             1      151885
--    160      151885             1      151885
--    170      151885             1      151885
--    180      151885             1      151885
--    190      151885             1      151885
--    200      151885             1      151885
--    210      151885             1      151885
--    220      151885             1      151885
--    230      151885             1      151885
--    240      151885             1      151885
--    250      151885             1      151885
--
-- Report changed.
-- Finished stage 'unitigCheck', reset canuIteration.
----------------------------------------
-- Starting command on Thu Nov  7 13:59:24 2019 with 311000.284 GB free disk space

    cd unitigging
    /lustre/user/snp/quanc/software/canu-1.9/Linux-amd64/bin/sqStoreCreatePartition \
      -S ../chr10_100020000-100080000.seqStore \
      -T  ./chr10_100020000-100080000.ctgStore 1 \
      -b 15000 \
      -p 8 \
    > ./chr10_100020000-100080000.ctgStore/partitionedReads.log 2>&1

-- Finished on Thu Nov  7 13:59:24 2019 (furiously fast) with 311000.284 GB free disk space
----------------------------------------
----------------------------------------
-- Starting command on Thu Nov  7 13:59:24 2019 with 311000.284 GB free disk space

    cd unitigging
    /lustre/user/snp/quanc/software/canu-1.9/Linux-amd64/bin/sqStoreCreatePartition \
      -S ../chr10_100020000-100080000.seqStore \
      -T  ./chr10_100020000-100080000.utgStore 1 \
      -b 15000 \
      -p 8 \
> ./chr10_100020000-100080000.utgStore/partitionedReads.log 2>&1

-- Finished on Thu Nov  7 13:59:24 2019 (furiously fast) with 311000.284 GB free disk space
----------------------------------------
-- Using slow alignment for consensus (iteration '0').
-- Configured 1 contig and 1 unitig consensus jobs.
-- No change in report.
-- Finished stage 'consensusConfigure', reset canuIteration.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: cns        1 GB    4 CPUs x   8 jobs     8 GB   32 CPUs  (consensus)
--
-- No change in report.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'cns' concurrent execution on Thu Nov  7 13:59:25 2019 with 311000.28 GB free disk space (2 processes; 8 concurrently)

    cd unitigging/5-consensus
    ./consensus.sh 1 > ./consensus.000001.out 2>&1
    ./consensus.sh 2 > ./consensus.000002.out 2>&1

-- Finished on Thu Nov  7 14:00:38 2019 (73 seconds) with 311000.259 GB free disk space
----------------------------------------

ABORT:
ABORT: Canu 1.9
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   can't open 'unitigging/5-consensus/consensus.sh' for writing: text file busy.

Here is the header of log (one node in grid):

-- Canu 1.9
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_131' (from '/lustre/user/snp/soft/jre1.8.0_131/bin/java') with -d64 support.
-- Detected gnuplot version '5.0 patchlevel 6   ' (from 'gnuplot') and image format 'svg'.
-- Detected 32 CPUs and 63 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/local/bin/sinfo.
-- Grid engine disabled per useGrid=false option.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: meryl      7 GB    4 CPUs x   8 jobs    56 GB   32 CPUs  (k-mer counting)
-- Local: hap        7 GB    4 CPUs x   8 jobs    56 GB   32 CPUs  (read-to-haplotype assignment)
-- Local: cormhap    6 GB   16 CPUs x   2 jobs    12 GB   32 CPUs  (overlap detection with mhap)
-- Local: obtmhap    6 GB   16 CPUs x   2 jobs    12 GB   32 CPUs  (overlap detection with mhap)
-- Local: utgmhap    6 GB   16 CPUs x   2 jobs    12 GB   32 CPUs  (overlap detection with mhap)
-- Local: cor        8 GB    4 CPUs x   7 jobs    56 GB   28 CPUs  (read correction)
-- Local: ovb        4 GB    1 CPU  x  15 jobs    60 GB   15 CPUs  (overlap store bucketizer)
-- Local: ovs        8 GB    1 CPU  x   7 jobs    56 GB    7 CPUs  (overlap store sorting)
-- Local: red        9 GB    4 CPUs x   7 jobs    63 GB   28 CPUs  (read error detection)
-- Local: oea        8 GB    1 CPU  x   7 jobs    56 GB    7 CPUs  (overlap error adjustment)
-- Local: bat       16 GB    4 CPUs x   1 job     16 GB    4 CPUs  (contig construction with bogart)
-- Local: cns      --- GB    4 CPUs x --- jobs   --- GB  --- CPUs  (consensus)
-- Local: gfa       16 GB    4 CPUs x   1 job     16 GB    4 CPUs  (GFA alignment and processing)
--
-- Found Nanopore uncorrected reads in the input files.
--
-- Generating assembly 'chr10_100000000-100060000' in '/lustre/user/snp/quanc/projects/20190426_nanopore_sv/assembly/canu/chr10/chr10_100000000-100060000'
--
-- Parameters:
--
--  genomeSize        60000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1050 ( 10.50%)
--    utgOvlErrorRate 0.1050 ( 10.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1050 ( 10.50%)
--    utgErrorRate    0.1050 ( 10.50%)
--    cnsErrorRate    0.1050 ( 10.50%)

I checked "5-consensus" and haven't found any abnormal information

1573121362500

Here is content of "consensus.000002.out"

Found perl:
   /usr/bin/perl

Found java:
   /lustre/user/snp/soft/jre1.8.0_131/bin/java
   java version "1.8.0_131"

Found canu:
   /lustre/user/snp/quanc/software/canu-1.9/Linux-amd64/bin/canu
   Canu 1.9

Running job 2 based on command line options.
-- Opening seqStore '../chr10_100020000-100080000.utgStore/partitionedReads.seqStore' partition 1.
-- Opening tigStore '../chr10_100020000-100080000.utgStore' version 1.
-- Opening output results file './utgcns/0001.cns.WORKING'.
--
-- Computing consensus for b=0 to e=1 with errorRate 0.1050 (max 0.4000) and minimum overlap 40
--
Consensus finished successfully.

Bye.
                           ----------CONTAINED READS----------  -DOVETAIL  READS-
  tigID    length   reads      used coverage  ignored coverage      used coverage
------- --------- -------  -------- -------- -------- --------  -------- --------
      1    151360     150       137   17.01x        0    0.00x        13    5.40x

Processed 1 tig and 0 singletons.

Looking forward for your reply. Thank you!

skoren commented 5 years ago

That seems like an issue with the FS and not releasing the running script. What if you just re-start the same canu command?

quanc1989 commented 5 years ago

If I don't delete the files in ''5-consensus" and just re-start the same canu command, same "ABORT" emerged.

If I delete the files in ''5-consensus" and then re-start the same canu command through sbatch with a single node in grid, same "ABORT" emerged again.

But if I delete the files in ''5-consensus" and re-start the same canu command in the local node (without sbatch), it worked.

So How to avoid that problem when I run canu in multiple nodes?

skoren commented 5 years ago

"If I delete the files in ''5-consensus" and then re-start the same canu command through sbatch with a single node in grid, same "ABORT" emerged again."

This is not running in multiple nodes and should be equivalent to a single node without sbatch. At the very least canu can't tell the difference between those two runs. If it is running differently, I'd expect there is some difference in how the file system is mounted/accessed between your sbatch run and the local run. If you have scratch in the compute nodes, try running on that instead of a shared FS to see if it works. You'd have to work with your cluster IT to figure out what's different in the filesystem between the local node and the sbatch instance, I don't see anything that would fix it in canu.

quanc1989 commented 5 years ago

Thanks a lot! I will try as you suggest.

marbl / canu

Can't open 'unitigging/5-consensus/consensus.sh' for writing #1552