marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
657 stars 179 forks source link

Canu assembly killed at ovs step #1843

Closed brettChapman closed 3 years ago

brettChapman commented 3 years ago

Hi

I'm using canu version 2.1. I'm running on a slurm managed cluster of 20 nodes. Each worker node is running Ubuntu 20.04, with 16 cores and 64GB RAM.

I am trialing Canu to assemble a Barley plant genome of approximate size 4.5Gbp using PacBio reads (a ~80% repetitive genome). We'll be expecting PacBio raw sequences early next year. I'm currently trialing canu using PacBio reads from this publication: https://www.nature.com/articles/s41597-020-0480-0

Interestingly, Canu only reports 62GB RAM available per node. I've found the assembly has been failing each time at the OVS step:

Looking in the log files here:

tibetanBarley/correction/tibetan_barley.ovlStore.BUILDING/logs/*

I find this error every time.

/var/spool/slurmd/job08677/slurm_script: line 107: 3535481 Killed                  $bin/ovStoreSorter -deletelate -O ./tibetan_barley.ovlStore.BUILDING -S ../tibetan_barley.seqStore -C ./tibetan_barley.ovlStore.config -f -s $jobid -M 62

I have tried different parameters, based on other users experiences, but with no success.

I first found that the OVS step used 32GB RAM, which was subsequently killed.

canu -p tibetan_barley -d tibetanBarley genomeSize=4.5g batMemory=62 batThreads=16 shell='/bin/bash' -pacbio *.fastq

I then tried increasing the memory allocated to OVS:

canu -p tibetan_barley -d tibetanBarley genomeSize=4.5g batMemory=62 batThreads=16 gridOptionsovs="--cpus-per-task=1 --mem-per-cpu=62g" 'ovsMemory=62g' shell='/bin/bash' -pacbio *.fastq

but the job was still killed at the OVS step, and then I tried other parameters I found which may limit the memory usage:

canu -p tibetan_barley -d tibetanBarley genomeSize=4.5g batMemory=62 batThreads=16 minOverlapLength=1000 corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 256 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" corMaxEvidenceErate=0.15 gridOptionsovs="--cpus-per-task=1 --mem-per-cpu=62g" 'ovsMemory=62g' shell='/bin/bash' -pacbio *.fastq

I'm guessing I've just reached a hard limit with memory. Unfortunately, I currently don't have access to any clusters with larger memory nodes. Is there a way I could get the genome assembly squeezed through with further tweaks to parameters, or is there no moving forward unless I acquire a cluster with more memory?

Thanks for any help you can provide.

skoren commented 3 years ago

Are you sure the job is being killed for memory usage or something else (like time). You said the first OVS step used 32gb of ram, there's only one OVS step during the store building so I'm not sure what you mean. The sorting can use arbitrary memory limits, it's just a matter of how many jobs you create. If it is failing with memory, you can try removing the tibetan_barley/correction/tibetan_barley.ovlStore.BUILDING folder, decrease ovsMemory, but leave the grid request the same:

canu -p tibetan_barley -d tibetanBarley genomeSize=4.5g batMemory=62 batThreads=16 gridOptionsovs="--cpus-per-task=1 --mem-per-cpu=62g" 'ovsMemory=32g' shell='/bin/bash' -pacbio *.fastq

That would create more smaller jobs.

P.S. You can't change other parameters like minOverlapLength or corMhap parameters at this point of the assembly since the overlaps are already computed and just being sorted/stored, you'd have to restart from scratch for those to take effect.

P.P.S. I'd strongly advise against raw PacBio reads and instead a recommend a good HiFi library (20kb or so in size) if you can generate that instead. We've seen much better plant assemblies from HiFi data than raw PacBio data recently.

brettChapman commented 3 years ago

Thanks @skoren

I was assuming the limiting factor was memory. Yes, 32GB for the OVS step when running with default parameters with an expected genome size of 4.5Gbps and specifying a batMemory of 62GB.

I'll try the command as you suggest and let you know how it goes. In regards to the HiFi library, we're actually going to be receiving HiFi data early next year, so I imagine when I run the assembly proper on our own data, we'll get a much better assembly.

brettChapman commented 3 years ago

Hi @skoren

I deleted the BUILDING folder and restarted the Canu job using OVS with 32GB RAM, by amending the Canu command with 'ovsMemory=32g' and gridOptionsovs="--cpus-per-task=1 --mem-per-cpu=62g".

Found  20 hosts with  16 cores and   62 GB memory under Slurm control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     31.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16.000 GB   16 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   48.000 GB   16 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    24.000 GB   16 CPUs  (overlap detection)
-- Grid:  utgovl    24.000 GB   16 CPUs  (overlap detection)
-- Grid:  cor       24.000 GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       48.000 GB    8 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       62.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- In 'tibetan_barley.seqStore', found PacBio CLR reads:
--   PacBio CLR:               64
--
--   Raw:                      1
--
-- Generating assembly 'tibetan_barley' in '/data/Tibetan_hulless_barley_assembly/tibetanBarley':
--    - correct raw reads.
--    - trim corrected reads.
--    - assemble corrected and trimmed reads.
--
-- Parameters:
--
--  genomeSize        4500000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
--
-- Creating overlap store correction/tibetan_barley.ovlStore using:
--     86 buckets
--     86 slices
--        using at most 62 GB memory each
-- Overlap store bucketizer finished.
-- Finished stage 'cor-overlapStoreBucketizerCheck', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
--
-- 'scripts/2-sort.jobSubmit-01.sh' -> job 8864 tasks 1-86.
--
----------------------------------------
-- Starting command on Tue Nov 17 11:50:51 2020 with 77810.218 GB free disk space

    cd /data/Tibetan_hulless_barley_assembly/tibetanBarley
    sbatch \
      --depend=afterany:8864 \
      --cpus-per-task=1 \
      --mem-per-cpu=4g   \
      -D `pwd` \
      -J 'canu_tibetan_barley' \
      -o canu-scripts/canu.08.out  canu-scripts/canu.08.sh
Submitted batch job 8865

However, I'm still getting the killed error in my BUILDING folder logs:

Found perl:
   /usr/bin/perl
   This is perl 5, version 30, subversion 0 (v5.30.0) built for x86_64-linux-gnu-thread-multi

Found java:
   /usr/bin/java
   openjdk version "11.0.9.1" 2020-11-04

Found canu:
   /data/canu-2.1/bin/canu
   canu 2.1

Running job 10 based on SLURM_ARRAY_TASK_ID=10 and offset=0.

Attempting to increase maximum allowed processes and open files.
  Changed max processes per user from 128333 to 257158 (max 257158).
  Changed max open files from 1024 to 131072 (max 131072).

Finding overlaps.
  found   19628507 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0001/sliceSizes'.
  found   80994401 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0003/sliceSizes'.
  found  177629411 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0004/sliceSizes'.
  found    9789145 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0005/sliceSizes'.
  found  172960737 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0007/sliceSizes'.
  found   83693679 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0008/sliceSizes'.
  found  293734477 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0009/sliceSizes'.
  found   11883569 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0013/sliceSizes'.
  found    6332754 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0017/sliceSizes'.
  found   89354581 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0021/sliceSizes'.
  found    9929750 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0023/sliceSizes'.
  found    9256311 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0025/sliceSizes'.
  found  108608719 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0027/sliceSizes'.
  found   12890761 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0029/sliceSizes'.
  found   13298594 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0033/sliceSizes'.
  found   12591500 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0037/sliceSizes'.
  found   11052262 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0041/sliceSizes'.
  found   13440876 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0045/sliceSizes'.
  found   13947501 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0049/sliceSizes'.
  found   13978420 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0053/sliceSizes'.
  found  259439799 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0057/sliceSizes'.
  found    9701381 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0060/sliceSizes'.
  found    9187269 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0061/sliceSizes'.
  found  171235816 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0065/sliceSizes'.
  found  201286637 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0067/sliceSizes'.
  found  123275995 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0069/sliceSizes'.
  found  196991874 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0071/sliceSizes'.
  found   55263323 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0073/sliceSizes'.
  found   93134137 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0075/sliceSizes'.
  found   10886416 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0076/sliceSizes'.
  found   10314316 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0077/sliceSizes'.
  found  227801588 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0079/sliceSizes'.
  found    9720528 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0080/sliceSizes'.
  found   10288184 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0081/sliceSizes'.
  found  157996335 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0083/sliceSizes'.
  found   10051997 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0085/sliceSizes'.

Loading 2721571550 overlaps using 60.83 GB of allowed (-M) 62 GB memory.
/var/spool/slurmd/job08875/slurm_script: line 107: 3252756 Killed                  $bin/ovStoreSorter -deletelate -O ./tibetan_barley.ovlStore.BUILDING -S ../tibetan_barley.seqStore -C ./tibetan_barley.ovlStore.config -f -s $jobid -M 62

If it is a memory issue, is there a way to limit the number of overlaps loaded in per job, and simply send out more jobs on the cluster instead of adding more overlaps to a single job? Could increasing the minimum overlap length also help reduce the number of overlaps needing to be loaded in per job? Thanks.

skoren commented 3 years ago

My mistake, I forgot to tell you to delete the configuration, it's still trying to use 62gb of memory not 32. The giveaway is this line:

Loading 2721571550 overlaps using 60.83 GB of allowed (-M) 62 GB memory.

Delete both the BUILDING folder and the tibetan_barley.ovlStore.config* files and re-launch again. You can confirm it is properly using 32gb after that by checking the line I listed above and checking that the 2-sort.sh shell script has the -M 32 parameter not -M 62.

brettChapman commented 3 years ago

Hi @skoren

I've now reran after deleting the BUILDING folder and config files.

I'm still getting a killed error in the logs:

Found perl:
   /usr/bin/perl
   This is perl 5, version 30, subversion 0 (v5.30.0) built for x86_64-linux-gnu-thread-multi

Found java:
   /usr/bin/java
   openjdk version "11.0.9.1" 2020-11-04

Found canu:
   /data/canu-2.1/bin/canu
   canu 2.1

Running job 1 based on SLURM_ARRAY_TASK_ID=1 and offset=0.

Attempting to increase maximum allowed processes and open files.
  Changed max processes per user from 128333 to 257158 (max 257158).
  Changed max open files from 1024 to 131072 (max 131072).

Finding overlaps.
  found  189917770 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0001/sliceSizes'.
  found  243079066 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0002/sliceSizes'.
  found  208919825 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0003/sliceSizes'.
  found  200322163 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0004/sliceSizes'.
  found  128253108 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0005/sliceSizes'.
  found  163999330 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0006/sliceSizes'.
  found  142047448 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0007/sliceSizes'.
  found  134018380 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0008/sliceSizes'.

Loading 1410557090 overlaps using 31.53 GB of allowed (-M) 33 GB memory.
/var/spool/slurmd/job09164/slurm_script: line 107: 993753 Killed                  $bin/ovStoreSorter -deletelate -O ./tibetan_barley.ovlStore.BUILDING -S ../tibetan_barley.seqStore -C ./tibetan_barley.ovlStore.config -f -s $jobid -M 33

And in the canu.out log:

CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '11.0.9.1' (from 'java') without -d64 support.
-- Detected gnuplot version '5.2 patchlevel 8   ' (from 'gnuplot') and image format 'png'.
-- Detected 16 CPUs and 63 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with task IDs up to 1000 allowed.
-- 
-- Found  20 hosts with  16 cores and   62 GB memory under Slurm control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     31.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16.000 GB   16 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   48.000 GB   16 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    24.000 GB   16 CPUs  (overlap detection)
-- Grid:  utgovl    24.000 GB   16 CPUs  (overlap detection)
-- Grid:  cor       24.000 GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       48.000 GB    8 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       62.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- In 'tibetan_barley.seqStore', found PacBio CLR reads:
--   PacBio CLR:               64
--
--   Raw:                      1
--
-- Generating assembly 'tibetan_barley' in '/data/Tibetan_hulless_barley_assembly/tibetanBarley':
--    - correct raw reads.
--    - trim corrected reads.
--    - assemble corrected and trimmed reads.
--
-- Parameters:
--
--  genomeSize        4500000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
--
-- Creating overlap store correction/tibetan_barley.ovlStore using:
--    165 buckets
--    165 slices
--        using at most 33 GB memory each
--

-- Overlap store sorting jobs failed, retry.
--   job correction/tibetan_barley.ovlStore.BUILDING/0001 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0002 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0003 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0004 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0005 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0006 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0007 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0008 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0009 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0010 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0011 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0012 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0013 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0014 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0015 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0016 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0017 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0018 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0019 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0020 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0021 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0022 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0023 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0024 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0025 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0026 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0027 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0028 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0029 FAILED.
--   job correction/tibetan_barley.ovlStore.BUILDING/0030 FAILED.
-- Running jobs.  Second attempt out of 2.
--
-- 'scripts/2-sort.jobSubmit-01.sh' -> job 9328 tasks 1-143.
-- 'scripts/2-sort.jobSubmit-02.sh' -> job 9329 task 146.
-- 'scripts/2-sort.jobSubmit-03.sh' -> job 9330 task 148.
-- 'scripts/2-sort.jobSubmit-04.sh' -> job 9331 task 150.
-- 'scripts/2-sort.jobSubmit-05.sh' -> job 9403 task 155.
--
----------------------------------------
-- Starting command on Thu Nov 19 22:59:02 2020 with 77691.018 GB free disk space

    cd /data/Tibetan_hulless_barley_assembly/tibetanBarley
    sbatch \
      --depend=afterany:9328:9329:9330:9331:9403 \
      --cpus-per-task=1 \
      --mem-per-cpu=5g   \
      -D `pwd` \
      -J 'canu_tibetan_barley' \
      -o canu-scripts/canu.11.out  canu-scripts/canu.11.sh
Submitted batch job 9404

-- Finished on Thu Nov 19 22:59:02 2020 (lickety-split) with 77691.018 GB free disk space

There are roughly twice the number of buckets and slices as before, since the memory has dropped from 62GB to 33GB. Are these buckets and slices all being ran on the same node? Is it possible to split them up across more nodes in the cluster? Or do you think it may be a problem other than memory? Thanks.

skoren commented 3 years ago

OK, so now it is properly only trying to use 32gb.

How much are these jobs requesting/using on the grid before being killed? Can you run an interactive session reserving 62gb and watch the job's memory as it runs? They shouldn't be running more than 1/node since the option of gridOptionsovs="--cpus-per-task=1 --mem-per-cpu=62g" should request a whole node. Unless memory is not being properly tracked somehow, I would be surprised if the the job targeting 32gb is using over 60gb of memory.

We've seen some rare cases of disk corrupting during bucketizing when the file system can't handle the concurrent I/O. If you find the memory is accurately requested from the grid and the job is indeed trying to use more than 60gb that is the next likeliest cause which may require limiting the maximum number of jobs allowed to run on the grid (via gridEngineArrayOption=" -a ARRAY_JOBS%10" which limits to 10 jobs).

brettChapman commented 3 years ago

Hi @skoren

I just ran ovStoreSorter and monitored the memory. It never went above 33.8GB of memory usage.

I'll try limiting the number of jobs to 10 as you suggest and see how it goes. I'm pretty confident this is what is killing the assembly. Earlier in the year I was using the Cactus genome alignment tool, which creates thousands of concurrent jobs. I also needed to limit the number of concurrent jobs, as the file system wasn't able to handle it. I'll let you know how it goes.

brettChapman commented 3 years ago

Hi @skoren

I ran Canu over the weekend, and it appears even though I've limited the number of concurrent jobs to 10, I still get the same error message:

  found  111216032 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0156/sliceSizes'.
  found    7434837 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0157/sliceSizes'.
  found    8351930 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0158/sliceSizes'.
  found    7385255 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0160/sliceSizes'.
  found    7391431 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0161/sliceSizes'.
  found   14807461 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0162/sliceSizes'.
  found    4396813 overlaps in './tibetan_barley.ovlStore.BUILDING/bucket0163/sliceSizes'.

Loading 1410551091 overlaps using 31.53 GB of allowed (-M) 33 GB memory.
  loading    8038504 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0002/slice0103'.
  loading   11554645 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0003/slice0103'.
  loading    7136083 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0004/slice0103'.
  loading    9265798 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0005/slice0103'.
  loading   11362065 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0007/slice0103'.
  loading    7166748 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0010/slice0103'.
  loading    5710664 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0011/slice0103'.
  loading    9396051 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0012/slice0103'.
  loading    3637609 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0013/slice0103'.
  loading   13441502 overlaps from './tibetan_barley.ovlStore.BUILDING/bucket0015/slice0103'.
/var/spool/slurmd/job09759/slurm_script: line 107: 156206 Killed                  $bin/ovStoreSorter -deletelate -O ./tibetan_barley.ovlStore.BUILDING -S ../tibetan_barley.seqStore -C ./tibetan_barley.ovlStore.config -f -s $jobid -M 33

This error looks a little different, as it appears some jobs get different degrees of the way through loading the slice before failing. Previously the job was killed even before any overlaps were loaded from a slice.

What if I limited the number of concurrent jobs further, and also increased the RAM to 62GB for those concurrent jobs?

skoren commented 3 years ago

The error doesn't look any different to me so I don't think changing concurrency will matter. Aren't these jobs already requesting 62gb of memory from your cluster? Doesn't matter to Canu if you specify 32 or 62 after the number of buckets/sorting jobs are set. The "Killed" message is coming from the cluster. If the jobs are requesting 62gb of memory and you're not seeing them go above 33gb of usage when they get killed, something is going wrong with your grid scheduler.

brettChapman commented 3 years ago

I never saw the process get killed off after running interactively. I could have sat there for hours waiting. While I did watch the process run, it never appeared to go over 33GB.

I tried running this command on the node to get more information about the killed processes:

dmesg -T| grep -E -i -B100 'killed process'

Which outputs this:

[Thu Nov 19 13:08:37 2020] [1854862]     0 1854862    58681       74    90112      155             0 switcheroo-cont
[Thu Nov 19 13:08:37 2020] [1855644]   119 1855644     2116       28    53248       78             0 avahi-daemon
[Thu Nov 19 13:08:37 2020] [1855645]   119 1855645     2082        0    53248       81             0 avahi-daemon
[Thu Nov 19 13:08:37 2020] [1855807]     0 1855807     3418       13    69632      146             0 wpa_supplicant
[Thu Nov 19 13:08:37 2020] [1856121]     0 1856121    65204      622   147456      477             0 NetworkManager
[Thu Nov 19 13:08:37 2020] [1856373]     0 1856373    78434      184   114688      459             0 ModemManager
[Thu Nov 19 13:08:37 2020] [1859379]   116 1859379    38231        0    65536       83             0 rtkit-daemon
[Thu Nov 19 13:08:37 2020] [2026562]     0 2026562    62808      103   106496      351             0 upowerd
[Thu Nov 19 13:08:37 2020] [2264154]     0 2264154   510634     1978   495616     6789          -500 dockerd
[Thu Nov 19 13:08:37 2020] [2321082]     0 2321082   861265     2229   557056     2909          -900 snapd
[Thu Nov 19 13:08:37 2020] [2955688]     0 2955688    59327       12    98304      227             0 accounts-daemon
[Thu Nov 19 13:08:37 2020] [3016607]     0 3016607   137587      183   131072      851             0 slurmd
[Thu Nov 19 13:08:37 2020] [3039299]   112 3039299    50619      125    65536      111             0 munged
[Thu Nov 19 13:08:37 2020] [3300027]     0 3300027    20022        7    86016      307         -1000 slurmstepd
[Thu Nov 19 13:08:37 2020] [3300032]  1000 3300032     2154       30    57344       74             0 slurm_script
[Thu Nov 19 13:08:37 2020] [3300068]  1000 3300068  8725209  3181895 59240448  4201843             0 ovStoreSorter
[Thu Nov 19 13:08:37 2020] [3300100]     0 3300100    20022        7    77824      307         -1000 slurmstepd
[Thu Nov 19 13:08:37 2020] [3300105]  1000 3300105     2154       30    61440       74             0 slurm_script
[Thu Nov 19 13:08:37 2020] [3300140]  1000 3300140  8724874  3017164 61714432  4675359             0 ovStoreSorter
[Thu Nov 19 13:08:37 2020] [3300149]     0 3300149    20022        7    81920      307         -1000 slurmstepd
[Thu Nov 19 13:08:37 2020] [3300154]  1000 3300154     2154       25    61440       74             0 slurm_script
[Thu Nov 19 13:08:37 2020] [3300211]  1000 3300211  8725181  2052524 42803200  3279909             0 ovStoreSorter
[Thu Nov 19 13:08:37 2020] [3300232]     0 3300232    20022        7    77824      306         -1000 slurmstepd
[Thu Nov 19 13:08:37 2020] [3300237]  1000 3300237     2154       30    61440       75             0 slurm_script
[Thu Nov 19 13:08:37 2020] [3300283]  1000 3300283  8725064  2546851 42864640  2793537             0 ovStoreSorter
[Thu Nov 19 13:08:37 2020] [3300304]     0 3300304    20022      251    73728       64         -1000 slurmstepd
[Thu Nov 19 13:08:37 2020] [3300309]  1000 3300309     2154       30    53248       75             0 slurm_script
[Thu Nov 19 13:08:37 2020] [3300351]  1000 3300351  8725114  2371789 31617024  1565316             0 ovStoreSorter
[Thu Nov 19 13:08:37 2020] [3300385]     0 3300385    20022      314    81920        0         -1000 slurmstepd
[Thu Nov 19 13:08:37 2020] [3300390]  1000 3300390     2154       59    57344       45             0 slurm_script
[Thu Nov 19 13:08:37 2020] [3300425]  1000 3300425  8725166  1864692 19329024   538629             0 ovStoreSorter
[Thu Nov 19 13:08:37 2020] [3300447]     0 3300447    20022      314    77824        0         -1000 slurmstepd
[Thu Nov 19 13:08:37 2020] [3300452]  1000 3300452     2154       63    53248       42             0 slurm_script
[Thu Nov 19 13:08:37 2020] [3300493]  1000 3300493  8724457   880001  8876032   219723             0 ovStoreSorter
[Thu Nov 19 13:08:37 2020] [3300517]     0 3300517    20022      590    77824        0         -1000 slurmstepd
[Thu Nov 19 13:08:37 2020] [3300522]  1000 3300522     2154      548    53248        0             0 slurm_script
[Thu Nov 19 13:08:37 2020] [3300557]  1000 3300557   293257   233088  1916928        0             0 ovStoreSorter
[Thu Nov 19 13:08:37 2020] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=step_batch,mems_allowed=0,global_oom,task_memcg=/slurm_node-3/uid_1000/job_9264/step_batch,task=ovStoreSorter,pid=3300140,uid=1000
[Thu Nov 19 13:08:37 2020] Out of memory: Killed process 3300140 (ovStoreSorter) total-vm:34899496kB, anon-rss:12068596kB, file-rss:60kB, shmem-rss:0kB, UID:1000 pgtables:60268kB oom_score_adj:0

Based on this, it definitely looks like it runs out of memory at 33GB. Each node has 64GB RAM and I also have a swap file of around the same size. The total virtual memory here used up by ovStoreSorter is 34.8GB. I've ran other jobs with different tools (genome graph construction) which have peaked at 144GB virtual memory before being killed, so if it was the scheduler, I'd imagine I would be getting similar peak memory usage elsewhere as well. In those cases I have been running a single job on a single node, so perhaps it's how Canu is interacting with my Slurm scheduler. I did install Slurm myself, slurmdbd, slurmd, slurmctld, MariaDB, MySQL, etc with no prior experience, which I documented in a GitHub repo (https://brettchapman.github.io/Nimbus_Cluster/). So far I've had no issues running my cluster using Slurm, but perhaps I left out a vital component that Canu needs?

skoren commented 3 years ago

You still have the gridOptionsovs="--cpus-per-task=1 --mem-per-cpu=62g parameter in the command line? What is the submit command for the jobs (./tibetan_barley.ovlStore.BUILDING/scripts/2-sort.jobSubmit-01.sh)? If those are there it should be requesting 62gb of memory in which case it should definitely not be killed by the scheduler at 34gb.

brettChapman commented 3 years ago

Yes, I still have the gridOptionsovs parameter with --cpus-per-task=1 --mem-per-cpu=62g. The ovsMemory is set to 32GB.

The submit command in that script you mention is:

#!/bin/sh

sbatch \
  --cpus-per-task=1 --mem-per-cpu=35g --cpus-per-task=1 --mem-per-cpu=62g -o logs/2-sort.%A_%a.out \
  -D `pwd` -J "ovS_tibetan_barley" \
  -a 1-165%10 \
  `pwd`/scripts/2-sort.sh 0 \
> ./scripts/2-sort.jobSubmit-01.out 2>&1

Which is weird, because it specifies --mem-per-cpu twice. I'm not sure what impact specifying this twice with two different values would have?

I had a look at the script when I specified ovsMemory to 62GB instead of 32GB, and it looks like this:

#!/bin/sh

sbatch \
  --cpus-per-task=1 --mem-per-cpu=64g --cpus-per-task=1 --mem-per-cpu=62g -o logs/2-sort.%A_%a.out \
  -D `pwd` -J "ovS_tibetan_barley" \
  -a 1-86%1 \
  `pwd`/scripts/2-sort.sh 0 \
> ./scripts/2-sort.jobSubmit-01.out 2>&1

It's specifiying --mem-per-cpu=64g as well as 62g. The 64g may push it over the limit. Even though the nodes have 64GB RAM, a htop only reveals 62.8GB of memory available.

Could it be possible that the extra --mem-per-cpu comes from the initial submission script? How do you usually submit a Canu job to a slurm managed cluster?

When I first submit the script to run, it looks like this (submit_canu.sh):

#!/bin/bash
#SBATCH --nodes=1
# allow use of all the memory on the node
#SBATCH --ntasks-per-node=16
#SBATCH --mem=0
# request all CPU cores on the node
#SBATCH --exclusive
#SBATCH --partition=debug

set -o nounset -o errexit -o xtrace

CANUPATH="/data/canu-2.1/bin/"

srun -n 1 ${CANUPATH}canu -p tibetan_barley -d tibetanBarley genomeSize=4.5g batMemory=62 batThreads=16 gridEngineArrayOption=" -a ARRAY_JOBS%10" minOverlapLength=1000 corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 256 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" corMaxEvidenceErate=0.15 gridOptionsovs="--cpus-per-task=1 --mem-per-cpu=62g" 'ovsMemory=32g' shell='/bin/bash' -pacbio *.fastq

I submit like this:

sbatch submit_canu.sh

At the top of my submission script with the SBATCH parameters, could they maybe be carrying over to the other scripts causing problem? Does my submission script look ok, or should I add/remove something? Thanks.

skoren commented 3 years ago

It does look like it's requesting 62gb of memory so if it is being killed while only using 32, that seems like a slurm issue, not sure there is anything you can do in Canu to fix it. The reason for two mem-per-cpu parameters is that canu wants to use 35gb since it expects the jobs to use about 32 but then you adding ovsOptions adds the second request of 62gb. The second parameter will overwrite the first on slurm so that is fine.

I wonder if your per-cpu resource tracking isn't working correctly. I see you're requesting --mem=0 in the initial command. Try editing the submit script to remove all the cpus and memory commands and request as you do in the initial submission (--nodes=1 --mem=0 --exclusive) and run the submission shell script by hand from the tibetan_barley.ovlStore.BUILDING folder and see if the jobs still get killed for memory or not.

brettChapman commented 3 years ago

Hi @skoren

I tried removing the SBATCH parameters you mentioned, and I also reran without the gridoptions for CPU and memory. I've also updated my slurm configuration. I’ve increased swap space to 200GB per node and I’ve also added a line in the cgroup.conf file to constrain swap space. In addition, I've managed to acquire larger nodes with 128GB RAM, however rerunning the assembly seems to only have used the extra RAM for the contig construction near the end.

Found   8 hosts with  32 cores and  125 GB memory under Slurm control.
-- Found   4 hosts with  16 cores and   62 GB memory under Slurm control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     31.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16.000 GB   16 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   48.000 GB   16 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    24.000 GB   16 CPUs  (overlap detection)
-- Grid:  utgovl    24.000 GB   16 CPUs  (overlap detection)
-- Grid:  cor       24.000 GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       41.000 GB    8 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      125.000 GB   32 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- In 'tibetan_barley.seqStore', found PacBio CLR reads:
--   PacBio CLR:               64
--
--   Raw:                      1
--
-- Generating assembly 'tibetan_barley' in '/data/Tibetan_hulless_barley_assembly/tibetanBarley':
--    - correct raw reads.
--    - trim corrected reads.
--    - assemble corrected and trimmed reads.
--
-- Parameters:
--
--  genomeSize        4500000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)

You’ll notice in the attached screen shot, the swap space is being used quite a lot now, which I didn't notice before.

Screen Shot 2020-12-03 at 9 23 30 am

I'm happy to report that the assembly has gotten through the OVS step now, and is on to the read error detection.

This is my canu.out log file now:

-- Creating overlap store correction/tibetan_barley.ovlStore using:
--    189 buckets
--    189 slices
--        using at most 29 GB memory each
-- Overlap store sorter finished.
-- Finished stage 'cor-overlapStoreSorterCheck', reset canuIteration.
----------------------------------------
-- Starting command on Thu Dec  3 21:16:30 2020 with 76688.9 GB free disk space

    cd correction
    /data/canu-2.1/bin/ovStoreIndexer \
      -O  ./tibetan_barley.ovlStore.BUILDING \
      -S ../tibetan_barley.seqStore \
      -C  ./tibetan_barley.ovlStore.config \
      -delete \
    > ./tibetan_barley.ovlStore.BUILDING.index.err 2>&1

-- Finished on Thu Dec  3 21:58:04 2020 (2494 seconds) with 76808.6 GB free disk space
----------------------------------------
-- Checking store.
----------------------------------------
-- Starting command on Thu Dec  3 21:58:04 2020 with 76808.6 GB free disk space

    cd correction
    /data/canu-2.1/bin/ovStoreDump \
     -S ../tibetan_barley.seqStore \
     -O  ./tibetan_barley.ovlStore \
     -counts \
     > ./tibetan_barley.ovlStore/counts.dat 2> ./tibetan_barley.ovlStore/counts.err

-- Finished on Thu Dec  3 21:58:27 2020 (23 seconds) with 76808.26 GB free disk space
----------------------------------------
--
-- Overlap store 'correction/tibetan_barley.ovlStore' successfully constructed.
-- Found 232432302842 overlaps for 25402950 reads; 3079774 reads have no overlaps.
--
--
-- Purged 2738.375 GB in 1314 overlap output files.
-- Finished stage 'cor-createOverlapStore', reset canuIteration.
-- Set corMinCoverage=4 based on read coverage of 66.
-- Global filter scores will be estimated.
-- Computing correction layouts.
----------------------------------------
-- Starting command on Thu Dec  3 22:04:08 2020 with 79546.984 GB free disk space

    cd correction
    /data/canu-2.1/bin/generateCorrectionLayouts \
      -S ../tibetan_barley.seqStore \
      -O  ./tibetan_barley.ovlStore \
      -C  ./tibetan_barley.corStore.WORKING \
      -eE 0.15 \
      -eC 80 \
    > ./tibetan_barley.corStore.err 2>&1

I'll update again if the assembly finally completes. Thanks.

skoren commented 3 years ago

Any update?

brettChapman commented 3 years ago

Hi @skoren

The assembly completed, but not without a few issues.

I received this error before I updated the parameters and finished the assembly.

BEGIN ASSEMBLY
--
--
-- ERROR:  Read coverage (2.72) lower than allowed.
-- ERROR:    minInputCoverage  = 3
-- ERROR:    stopOnLowCoverage = 3
-- ERROR:
-- ERROR:  This could be caused by an incorrect genomeSize or poor quality
-- ERROR:  reads that cound not be sufficiently corrected.
-- ERROR:
-- ERROR:  You can force Canu to continue by decreasing parameters
-- ERROR:  minInputCoverage and/or stopOnLowCoverage.  Be warned that the
-- ERROR:  quality of corrected reads and/or contiguity of contigs will be poor.
-- 

So I reduced the minInputCoverage and stopOnLowCoverage to 2. Their default was 10, and I had a similar error previously which I then lowered it to 3. I now have 3.8GB of unassembled contigs, and 439Mb of assembled contigs.

I've decided to try and run from scratch again with mostly default parameters, because I think maybe the parameters I had chose just to get it through the assembly in an attempt to reduce memory consumption may have impacted the final assembly. Things like setting the overlap lengths higher, coupled with poor read quality may have likely caused the final assembly to be so small.

If I still find the assembly to be on the small side, it’s likely due to the reads being of poor quality. But since we'll be using HiFi data early next year, this step of the assembly may not be troublesome.

Huangtianyu-caas commented 3 years ago

Hi,

I am now facing a somewhat memory-related issue, which is not as the same as yours.

I ran canu with following commond:

canu -correct genomeSize=580m minReadLength=2000 minOverlapLength=500 corOutCoverage=120 corMinCoverage=0 correctedErrorRate=0.08 useGrid=false

  -- Canu 2.0
  --
  -- Detected Java(TM) Runtime Environment '10.0.2' (from '/public/home/huangtianyu/miniconda3/bin/java') without -d64 support.
  -- Detected gnuplot version '4.2 patchlevel 6 ' (from 'gnuplot') and image format 'png'.
-- Detected 96 CPUs and 2020 gigabytes of memory.
-- Detected PBS/Torque '4.2.7' with 'pbsnodes' binary in /opt/gridview/pbs/dispatcher/bin/pbsnodes.
-- Grid engine and staging disabled per useGrid=false option.
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     24.000 GB    8 CPUs x  12 jobs   288.000 GB  96 CPUs  (k-mer counting)
-- Local: hap       12.000 GB   24 CPUs x   4 jobs    48.000 GB  96 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   32.000 GB   16 CPUs x   6 jobs   192.000 GB  96 CPUs  (overlap detection with mhap)
-- Local: obtovl    16.000 GB   16 CPUs x   6 jobs    96.000 GB  96 CPUs  (overlap detection)
-- Local: utgovl    16.000 GB   16 CPUs x   6 jobs    96.000 GB  96 CPUs  (overlap detection)
-- Local: cor       16.000 GB    4 CPUs x  24 jobs   384.000 GB  96 CPUs  (read correction)
-- Local: ovb        4.000 GB    1 CPU  x  96 jobs   384.000 GB  96 CPUs  (overlap store bucketizer)
-- Local: ovs       16.000 GB    1 CPU  x  96 jobs  1536.000 GB  96 CPUs  (overlap store sorting)
-- Local: red       32.000 GB    8 CPUs x  12 jobs   384.000 GB  96 CPUs  (read error detection)
-- Local: oea        8.000 GB    1 CPU  x  96 jobs   768.000 GB  96 CPUs  (overlap error adjustment)
-- Local: bat      256.000 GB   16 CPUs x   1 job    256.000 GB  16 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB    8 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)
-- Local: gfa       32.000 GB   16 CPUs x   1 job     32.000 GB  16 CPUs  (GFA alignment and processing)
--
-- In 'Ttri.seqStore', found PacBio CLR reads:
--   PacBio CLR:               1
--
--   Raw:                      1
--
-- Generating assembly 'Ttri' in '/vol3/agis/wangguirong_group/huangtianyu/Ttri_assembly/data/clean':
--    - only correct raw reads.
--
-- Parameters:
--
--  genomeSize        580000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0800 (  8.00%)
--    utgOvlErrorRate 0.0800 (  8.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0800 (  8.00%)
--    utgErrorRate    0.0800 (  8.00%)
--    cnsErrorRate    0.0800 (  8.00%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
--
-- Creating overlap store correction/Ttri.ovlStore using:
--    227 buckets
--    449 slices
--        using at most 15 GB memory each
-- No change in report.
-- Finished stage 'cor-overlapStoreConfigure', reset canuIteration.
-- No change in report.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'ovB' concurrent execution on Tue Dec 22 16:10:34 2020 with 1084399.949 GB free disk space (227 processes; 96 concurrently)

    cd correction/Ttri.ovlStore.BUILDING
    ./scripts/1-bucketize.sh 1 > ./logs/1-bucketize.000001.out 2>&1
    ./scripts/1-bucketize.sh 2 > ./logs/1-bucketize.000002.out 2>&1
    ./scripts/1-bucketize.sh 3 > ./logs/1-bucketize.000003.out 2>&1
    ./scripts/1-bucketize.sh 4 > ./logs/1-bucketize.000004.out 2>&1
    ./scripts/1-bucketize.sh 5 > ./logs/1-bucketize.000005.out 2>&1
    ./scripts/1-bucketize.sh 6 > ./logs/1-bucketize.000006.out 2>&1
    ./scripts/1-bucketize.sh 7 > ./logs/1-bucketize.000007.out 2>&1
    ./scripts/1-bucketize.sh 8 > ./logs/1-bucketize.000008.out 2>&1
    ./scripts/1-bucketize.sh 9 > ./logs/1-bucketize.000009.out 2>&1
    ./scripts/1-bucketize.sh 10 > ./logs/1-bucketize.000010.out 2>&1
    ./scripts/1-bucketize.sh 11 > ./logs/1-bucketize.000011.out 2>&1
    ./scripts/1-bucketize.sh 12 > ./logs/1-bucketize.000012.out 2>&1
    ./scripts/1-bucketize.sh 13 > ./logs/1-bucketize.000013.out 2>&1
    ./scripts/1-bucketize.sh 14 > ./logs/1-bucketize.000014.out 2>&1
    ./scripts/1-bucketize.sh 15 > ./logs/1-bucketize.000015.out 2>&1
    ./scripts/1-bucketize.sh 16 > ./logs/1-bucketize.000016.out 2>&1
    ./scripts/1-bucketize.sh 17 > ./logs/1-bucketize.000017.out 2>&1
    ./scripts/1-bucketize.sh 18 > ./logs/1-bucketize.000018.out 2>&1
    ./scripts/1-bucketize.sh 19 > ./logs/1-bucketize.000019.out 2>&1
    ./scripts/1-bucketize.sh 20 > ./logs/1-bucketize.000020.out 2>&1
    ./scripts/1-bucketize.sh 21 > ./logs/1-bucketize.000021.out 2>&1
    ./scripts/1-bucketize.sh 22 > ./logs/1-bucketize.000022.out 2>&1
    ./scripts/1-bucketize.sh 23 > ./logs/1-bucketize.000023.out 2>&1
    ./scripts/1-bucketize.sh 24 > ./logs/1-bucketize.000024.out 2>&1
    ./scripts/1-bucketize.sh 25 > ./logs/1-bucketize.000025.out 2>&1
    ./scripts/1-bucketize.sh 26 > ./logs/1-bucketize.000026.out 2>&1
    ./scripts/1-bucketize.sh 27 > ./logs/1-bucketize.000027.out 2>&1
    ./scripts/1-bucketize.sh 28 > ./logs/1-bucketize.000028.out 2>&1
    ./scripts/1-bucketize.sh 29 > ./logs/1-bucketize.000029.out 2>&1
    ./scripts/1-bucketize.sh 30 > ./logs/1-bucketize.000030.out 2>&1
    ./scripts/1-bucketize.sh 31 > ./logs/1-bucketize.000031.out 2>&1
    ./scripts/1-bucketize.sh 32 > ./logs/1-bucketize.000032.out 2>&1
    ./scripts/1-bucketize.sh 33 > ./logs/1-bucketize.000033.out 2>&1
    ./scripts/1-bucketize.sh 34 > ./logs/1-bucketize.000034.out 2>&1
    ./scripts/1-bucketize.sh 35 > ./logs/1-bucketize.000035.out 2>&1
    ./scripts/1-bucketize.sh 36 > ./logs/1-bucketize.000036.out 2>&1
    ./scripts/1-bucketize.sh 37 > ./logs/1-bucketize.000037.out 2>&1
    ./scripts/1-bucketize.sh 38 > ./logs/1-bucketize.000038.out 2>&1
    ./scripts/1-bucketize.sh 39 > ./logs/1-bucketize.000039.out 2>&1
    ./scripts/1-bucketize.sh 40 > ./logs/1-bucketize.000040.out 2>&1
    ./scripts/1-bucketize.sh 41 > ./logs/1-bucketize.000041.out 2>&1
    ./scripts/1-bucketize.sh 42 > ./logs/1-bucketize.000042.out 2>&1
    ./scripts/1-bucketize.sh 43 > ./logs/1-bucketize.000043.out 2>&1
    ./scripts/1-bucketize.sh 44 > ./logs/1-bucketize.000044.out 2>&1
    ./scripts/1-bucketize.sh 45 > ./logs/1-bucketize.000045.out 2>&1
    ./scripts/1-bucketize.sh 46 > ./logs/1-bucketize.000046.out 2>&1
    ./scripts/1-bucketize.sh 47 > ./logs/1-bucketize.000047.out 2>&1
    ./scripts/1-bucketize.sh 48 > ./logs/1-bucketize.000048.out 2>&1
    ./scripts/1-bucketize.sh 49 > ./logs/1-bucketize.000049.out 2>&1
    ./scripts/1-bucketize.sh 50 > ./logs/1-bucketize.000050.out 2>&1
    ./scripts/1-bucketize.sh 51 > ./logs/1-bucketize.000051.out 2>&1
    ./scripts/1-bucketize.sh 52 > ./logs/1-bucketize.000052.out 2>&1
    ./scripts/1-bucketize.sh 53 > ./logs/1-bucketize.000053.out 2>&1
    ./scripts/1-bucketize.sh 54 > ./logs/1-bucketize.000054.out 2>&1
    ./scripts/1-bucketize.sh 55 > ./logs/1-bucketize.000055.out 2>&1
    ./scripts/1-bucketize.sh 56 > ./logs/1-bucketize.000056.out 2>&1
    ./scripts/1-bucketize.sh 57 > ./logs/1-bucketize.000057.out 2>&1
    ./scripts/1-bucketize.sh 58 > ./logs/1-bucketize.000058.out 2>&1
    ./scripts/1-bucketize.sh 59 > ./logs/1-bucketize.000059.out 2>&1
    ./scripts/1-bucketize.sh 60 > ./logs/1-bucketize.000060.out 2>&1
    ./scripts/1-bucketize.sh 61 > ./logs/1-bucketize.000061.out 2>&1
    ./scripts/1-bucketize.sh 62 > ./logs/1-bucketize.000062.out 2>&1
    ./scripts/1-bucketize.sh 63 > ./logs/1-bucketize.000063.out 2>&1
    ./scripts/1-bucketize.sh 64 > ./logs/1-bucketize.000064.out 2>&1
    ./scripts/1-bucketize.sh 65 > ./logs/1-bucketize.000065.out 2>&1
    ./scripts/1-bucketize.sh 66 > ./logs/1-bucketize.000066.out 2>&1
    ./scripts/1-bucketize.sh 67 > ./logs/1-bucketize.000067.out 2>&1
    ./scripts/1-bucketize.sh 68 > ./logs/1-bucketize.000068.out 2>&1
    ./scripts/1-bucketize.sh 69 > ./logs/1-bucketize.000069.out 2>&1
    ./scripts/1-bucketize.sh 70 > ./logs/1-bucketize.000070.out 2>&1
    ./scripts/1-bucketize.sh 71 > ./logs/1-bucketize.000071.out 2>&1
    ./scripts/1-bucketize.sh 72 > ./logs/1-bucketize.000072.out 2>&1
    ./scripts/1-bucketize.sh 73 > ./logs/1-bucketize.000073.out 2>&1
    ./scripts/1-bucketize.sh 74 > ./logs/1-bucketize.000074.out 2>&1
    ./scripts/1-bucketize.sh 75 > ./logs/1-bucketize.000075.out 2>&1
    ./scripts/1-bucketize.sh 76 > ./logs/1-bucketize.000076.out 2>&1
    ./scripts/1-bucketize.sh 77 > ./logs/1-bucketize.000077.out 2>&1
    ./scripts/1-bucketize.sh 78 > ./logs/1-bucketize.000078.out 2>&1
    ./scripts/1-bucketize.sh 79 > ./logs/1-bucketize.000079.out 2>&1
    ./scripts/1-bucketize.sh 80 > ./logs/1-bucketize.000080.out 2>&1
    ./scripts/1-bucketize.sh 81 > ./logs/1-bucketize.000081.out 2>&1
    ./scripts/1-bucketize.sh 82 > ./logs/1-bucketize.000082.out 2>&1
    ./scripts/1-bucketize.sh 83 > ./logs/1-bucketize.000083.out 2>&1
    ./scripts/1-bucketize.sh 84 > ./logs/1-bucketize.000084.out 2>&1
    ./scripts/1-bucketize.sh 85 > ./logs/1-bucketize.000085.out 2>&1
    ./scripts/1-bucketize.sh 86 > ./logs/1-bucketize.000086.out 2>&1
    ./scripts/1-bucketize.sh 87 > ./logs/1-bucketize.000087.out 2>&1
    ./scripts/1-bucketize.sh 88 > ./logs/1-bucketize.000088.out 2>&1
    ./scripts/1-bucketize.sh 89 > ./logs/1-bucketize.000089.out 2>&1
    ./scripts/1-bucketize.sh 90 > ./logs/1-bucketize.000090.out 2>&1
    ./scripts/1-bucketize.sh 91 > ./logs/1-bucketize.000091.out 2>&1
    ./scripts/1-bucketize.sh 92 > ./logs/1-bucketize.000092.out 2>&1
    ./scripts/1-bucketize.sh 93 > ./logs/1-bucketize.000093.out 2>&1
    ./scripts/1-bucketize.sh 94 > ./logs/1-bucketize.000094.out 2>&1
    ./scripts/1-bucketize.sh 95 > ./logs/1-bucketize.000095.out 2>&1
    ./scripts/1-bucketize.sh 96 > ./logs/1-bucketize.000096.out 2>&1
    ./scripts/1-bucketize.sh 97 > ./logs/1-bucketize.000097.out 2>&1
    ./scripts/1-bucketize.sh 98 > ./logs/1-bucketize.000098.out 2>&1
    ./scripts/1-bucketize.sh 99 > ./logs/1-bucketize.000099.out 2>&1
    ./scripts/1-bucketize.sh 100 > ./logs/1-bucketize.000100.out 2>&1
    ./scripts/1-bucketize.sh 101 > ./logs/1-bucketize.000101.out 2>&1
    ./scripts/1-bucketize.sh 102 > ./logs/1-bucketize.000102.out 2>&1
    ./scripts/1-bucketize.sh 103 > ./logs/1-bucketize.000103.out 2>&1
    ./scripts/1-bucketize.sh 104 > ./logs/1-bucketize.000104.out 2>&1
    ./scripts/1-bucketize.sh 105 > ./logs/1-bucketize.000105.out 2>&1
    ./scripts/1-bucketize.sh 106 > ./logs/1-bucketize.000106.out 2>&1
    ./scripts/1-bucketize.sh 107 > ./logs/1-bucketize.000107.out 2>&1
    ./scripts/1-bucketize.sh 108 > ./logs/1-bucketize.000108.out 2>&1
    ./scripts/1-bucketize.sh 109 > ./logs/1-bucketize.000109.out 2>&1
    ./scripts/1-bucketize.sh 110 > ./logs/1-bucketize.000110.out 2>&1
    ./scripts/1-bucketize.sh 111 > ./logs/1-bucketize.000111.out 2>&1
    ./scripts/1-bucketize.sh 112 > ./logs/1-bucketize.000112.out 2>&1
    ./scripts/1-bucketize.sh 113 > ./logs/1-bucketize.000113.out 2>&1
    ./scripts/1-bucketize.sh 114 > ./logs/1-bucketize.000114.out 2>&1
    ./scripts/1-bucketize.sh 115 > ./logs/1-bucketize.000115.out 2>&1
    ./scripts/1-bucketize.sh 116 > ./logs/1-bucketize.000116.out 2>&1
    ./scripts/1-bucketize.sh 117 > ./logs/1-bucketize.000117.out 2>&1
    ./scripts/1-bucketize.sh 118 > ./logs/1-bucketize.000118.out 2>&1
    ./scripts/1-bucketize.sh 119 > ./logs/1-bucketize.000119.out 2>&1
    ./scripts/1-bucketize.sh 120 > ./logs/1-bucketize.000120.out 2>&1
    ./scripts/1-bucketize.sh 121 > ./logs/1-bucketize.000121.out 2>&1
    ./scripts/1-bucketize.sh 122 > ./logs/1-bucketize.000122.out 2>&1
    ./scripts/1-bucketize.sh 123 > ./logs/1-bucketize.000123.out 2>&1
    ./scripts/1-bucketize.sh 124 > ./logs/1-bucketize.000124.out 2>&1
    ./scripts/1-bucketize.sh 125 > ./logs/1-bucketize.000125.out 2>&1
    ./scripts/1-bucketize.sh 126 > ./logs/1-bucketize.000126.out 2>&1
    ./scripts/1-bucketize.sh 127 > ./logs/1-bucketize.000127.out 2>&1
    ./scripts/1-bucketize.sh 128 > ./logs/1-bucketize.000128.out 2>&1
    ./scripts/1-bucketize.sh 129 > ./logs/1-bucketize.000129.out 2>&1
    ./scripts/1-bucketize.sh 130 > ./logs/1-bucketize.000130.out 2>&1
    ./scripts/1-bucketize.sh 131 > ./logs/1-bucketize.000131.out 2>&1
    ./scripts/1-bucketize.sh 132 > ./logs/1-bucketize.000132.out 2>&1
    ./scripts/1-bucketize.sh 133 > ./logs/1-bucketize.000133.out 2>&1
    ./scripts/1-bucketize.sh 134 > ./logs/1-bucketize.000134.out 2>&1
    ./scripts/1-bucketize.sh 135 > ./logs/1-bucketize.000135.out 2>&1
    ./scripts/1-bucketize.sh 136 > ./logs/1-bucketize.000136.out 2>&1
    ./scripts/1-bucketize.sh 137 > ./logs/1-bucketize.000137.out 2>&1
    ./scripts/1-bucketize.sh 138 > ./logs/1-bucketize.000138.out 2>&1
    ./scripts/1-bucketize.sh 139 > ./logs/1-bucketize.000139.out 2>&1
    ./scripts/1-bucketize.sh 140 > ./logs/1-bucketize.000140.out 2>&1
    ./scripts/1-bucketize.sh 141 > ./logs/1-bucketize.000141.out 2>&1
    ./scripts/1-bucketize.sh 142 > ./logs/1-bucketize.000142.out 2>&1
    ./scripts/1-bucketize.sh 143 > ./logs/1-bucketize.000143.out 2>&1
    ./scripts/1-bucketize.sh 144 > ./logs/1-bucketize.000144.out 2>&1
    ./scripts/1-bucketize.sh 145 > ./logs/1-bucketize.000145.out 2>&1
    ./scripts/1-bucketize.sh 146 > ./logs/1-bucketize.000146.out 2>&1
    ./scripts/1-bucketize.sh 147 > ./logs/1-bucketize.000147.out 2>&1
    ./scripts/1-bucketize.sh 148 > ./logs/1-bucketize.000148.out 2>&1
    ./scripts/1-bucketize.sh 149 > ./logs/1-bucketize.000149.out 2>&1
    ./scripts/1-bucketize.sh 150 > ./logs/1-bucketize.000150.out 2>&1
    ./scripts/1-bucketize.sh 151 > ./logs/1-bucketize.000151.out 2>&1
    ./scripts/1-bucketize.sh 152 > ./logs/1-bucketize.000152.out 2>&1
    ./scripts/1-bucketize.sh 153 > ./logs/1-bucketize.000153.out 2>&1
    ./scripts/1-bucketize.sh 154 > ./logs/1-bucketize.000154.out 2>&1
    ./scripts/1-bucketize.sh 155 > ./logs/1-bucketize.000155.out 2>&1
    ./scripts/1-bucketize.sh 156 > ./logs/1-bucketize.000156.out 2>&1
    ./scripts/1-bucketize.sh 157 > ./logs/1-bucketize.000157.out 2>&1
    ./scripts/1-bucketize.sh 158 > ./logs/1-bucketize.000158.out 2>&1
    ./scripts/1-bucketize.sh 159 > ./logs/1-bucketize.000159.out 2>&1
    ./scripts/1-bucketize.sh 160 > ./logs/1-bucketize.000160.out 2>&1
    ./scripts/1-bucketize.sh 161 > ./logs/1-bucketize.000161.out 2>&1
    ./scripts/1-bucketize.sh 162 > ./logs/1-bucketize.000162.out 2>&1
    ./scripts/1-bucketize.sh 163 > ./logs/1-bucketize.000163.out 2>&1
    ./scripts/1-bucketize.sh 164 > ./logs/1-bucketize.000164.out 2>&1
    ./scripts/1-bucketize.sh 165 > ./logs/1-bucketize.000165.out 2>&1
    ./scripts/1-bucketize.sh 166 > ./logs/1-bucketize.000166.out 2>&1
    ./scripts/1-bucketize.sh 167 > ./logs/1-bucketize.000167.out 2>&1
    ./scripts/1-bucketize.sh 168 > ./logs/1-bucketize.000168.out 2>&1
    ./scripts/1-bucketize.sh 169 > ./logs/1-bucketize.000169.out 2>&1
    ./scripts/1-bucketize.sh 170 > ./logs/1-bucketize.000170.out 2>&1
    ./scripts/1-bucketize.sh 171 > ./logs/1-bucketize.000171.out 2>&1
    ./scripts/1-bucketize.sh 172 > ./logs/1-bucketize.000172.out 2>&1
    ./scripts/1-bucketize.sh 173 > ./logs/1-bucketize.000173.out 2>&1
    ./scripts/1-bucketize.sh 174 > ./logs/1-bucketize.000174.out 2>&1
    ./scripts/1-bucketize.sh 175 > ./logs/1-bucketize.000175.out 2>&1
    ./scripts/1-bucketize.sh 176 > ./logs/1-bucketize.000176.out 2>&1
    ./scripts/1-bucketize.sh 177 > ./logs/1-bucketize.000177.out 2>&1
    ./scripts/1-bucketize.sh 178 > ./logs/1-bucketize.000178.out 2>&1
    ./scripts/1-bucketize.sh 179 > ./logs/1-bucketize.000179.out 2>&1
    ./scripts/1-bucketize.sh 180 > ./logs/1-bucketize.000180.out 2>&1
    ./scripts/1-bucketize.sh 181 > ./logs/1-bucketize.000181.out 2>&1
    ./scripts/1-bucketize.sh 182 > ./logs/1-bucketize.000182.out 2>&1
    ./scripts/1-bucketize.sh 183 > ./logs/1-bucketize.000183.out 2>&1
    ./scripts/1-bucketize.sh 184 > ./logs/1-bucketize.000184.out 2>&1
    ./scripts/1-bucketize.sh 185 > ./logs/1-bucketize.000185.out 2>&1
    ./scripts/1-bucketize.sh 186 > ./logs/1-bucketize.000186.out 2>&1
    ./scripts/1-bucketize.sh 187 > ./logs/1-bucketize.000187.out 2>&1
    ./scripts/1-bucketize.sh 188 > ./logs/1-bucketize.000188.out 2>&1
    ./scripts/1-bucketize.sh 189 > ./logs/1-bucketize.000189.out 2>&1
    ./scripts/1-bucketize.sh 190 > ./logs/1-bucketize.000190.out 2>&1
    ./scripts/1-bucketize.sh 191 > ./logs/1-bucketize.000191.out 2>&1
    ./scripts/1-bucketize.sh 192 > ./logs/1-bucketize.000192.out 2>&1
    ./scripts/1-bucketize.sh 193 > ./logs/1-bucketize.000193.out 2>&1
    ./scripts/1-bucketize.sh 194 > ./logs/1-bucketize.000194.out 2>&1
    ./scripts/1-bucketize.sh 195 > ./logs/1-bucketize.000195.out 2>&1
    ./scripts/1-bucketize.sh 196 > ./logs/1-bucketize.000196.out 2>&1
    ./scripts/1-bucketize.sh 197 > ./logs/1-bucketize.000197.out 2>&1
    ./scripts/1-bucketize.sh 198 > ./logs/1-bucketize.000198.out 2>&1
    ./scripts/1-bucketize.sh 199 > ./logs/1-bucketize.000199.out 2>&1
    ./scripts/1-bucketize.sh 200 > ./logs/1-bucketize.000200.out 2>&1
    ./scripts/1-bucketize.sh 201 > ./logs/1-bucketize.000201.out 2>&1
    ./scripts/1-bucketize.sh 202 > ./logs/1-bucketize.000202.out 2>&1
    ./scripts/1-bucketize.sh 203 > ./logs/1-bucketize.000203.out 2>&1
    ./scripts/1-bucketize.sh 204 > ./logs/1-bucketize.000204.out 2>&1
    ./scripts/1-bucketize.sh 205 > ./logs/1-bucketize.000205.out 2>&1
    ./scripts/1-bucketize.sh 206 > ./logs/1-bucketize.000206.out 2>&1
    ./scripts/1-bucketize.sh 207 > ./logs/1-bucketize.000207.out 2>&1
    ./scripts/1-bucketize.sh 208 > ./logs/1-bucketize.000208.out 2>&1
    ./scripts/1-bucketize.sh 209 > ./logs/1-bucketize.000209.out 2>&1
    ./scripts/1-bucketize.sh 210 > ./logs/1-bucketize.000210.out 2>&1
    ./scripts/1-bucketize.sh 211 > ./logs/1-bucketize.000211.out 2>&1
    ./scripts/1-bucketize.sh 212 > ./logs/1-bucketize.000212.out 2>&1
    ./scripts/1-bucketize.sh 213 > ./logs/1-bucketize.000213.out 2>&1
    ./scripts/1-bucketize.sh 214 > ./logs/1-bucketize.000214.out 2>&1
    ./scripts/1-bucketize.sh 215 > ./logs/1-bucketize.000215.out 2>&1
    ./scripts/1-bucketize.sh 216 > ./logs/1-bucketize.000216.out 2>&1
    ./scripts/1-bucketize.sh 217 > ./logs/1-bucketize.000217.out 2>&1
    ./scripts/1-bucketize.sh 218 > ./logs/1-bucketize.000218.out 2>&1
    ./scripts/1-bucketize.sh 219 > ./logs/1-bucketize.000219.out 2>&1
    ./scripts/1-bucketize.sh 220 > ./logs/1-bucketize.000220.out 2>&1
    ./scripts/1-bucketize.sh 221 > ./logs/1-bucketize.000221.out 2>&1
    ./scripts/1-bucketize.sh 222 > ./logs/1-bucketize.000222.out 2>&1
    ./scripts/1-bucketize.sh 223 > ./logs/1-bucketize.000223.out 2>&1
    ./scripts/1-bucketize.sh 224 > ./logs/1-bucketize.000224.out 2>&1
    ./scripts/1-bucketize.sh 225 > ./logs/1-bucketize.000225.out 2>&1
    ./scripts/1-bucketize.sh 226 > ./logs/1-bucketize.000226.out 2>&1
    ./scripts/1-bucketize.sh 227 > ./logs/1-bucketize.000227.out 2>&1

-- Finished on Tue Dec 22 20:29:50 2020 (15556 seconds, fashionably late) with 1084710.268 GB free disk space
----------------------------------------
--
-- Overlap store bucketizer jobs failed, retry.
--   job correction/Ttri.ovlStore.BUILDING/bucket0022 FAILED.
--
-- No change in report.
--
-- Running jobs.  Second attempt out of 2.
----------------------------------------
-- Starting 'ovB' concurrent execution on Tue Dec 22 20:29:51 2020 with 1084710.268 GB free disk space (1 processes; 96 concurrently)

    cd correction/Ttri.ovlStore.BUILDING
    ./scripts/1-bucketize.sh 22 > ./logs/1-bucketize.000022.out 2>&1

-- Finished on Tue Dec 22 20:30:49 2020 (58 seconds) with 1084651.043 GB free disk space
----------------------------------------
--
-- Overlap store bucketizer jobs failed, tried 2 times, giving up.
--   job correction/Ttri.ovlStore.BUILDING/bucket0022 FAILED.
--

ABORT:
ABORT: Canu 2.0
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

And the report in correction/Ttri.ovlStore.BUILDING/logs/1-bucketize.000022.out is:

Found perl:
   /public/home/huangtianyu/miniconda3/bin/perl
   This is perl 5, version 26, subversion 2 (v5.26.2) built for x86_64-linux-thread-multi

Found java:
   /public/home/huangtianyu/miniconda3/bin/java
   openjdk version "10.0.2" 2018-07-17

Found canu:
   /public/home/huangtianyu/miniconda3/envs/canu/bin/canu
   Canu 2.0

Running job 22 based on command line options.

Attempting to increase maximum allowed processes and open files.
  Changed max processes per user from 1024 to 16544171 (max 16544171).
  Changed max open files from 1024 to 4096 (max 4096).

Overwriting incomplete result from presumed crashed job in directory './Ttri.ovlStore.BUILDING/create0022'.
sqStore_loadMetadata()-- Using 'raw' 0x01 reads.

Opened '../Ttri.seqStore' with 7931390 reads.

Constructing slice 22 for store './Ttri.ovlStore.BUILDING'.
 - Filtering overlaps over 1.0000 fraction error.

Bucketizing input    1 out of    1 - '1-overlapper/results/000022.ovb'
Failed to open './Ttri.ovlStore.BUILDING/create0022/slice0156' for writing: Cannot allocate memory

So, it seems that the problem is assiociated with the memory. Could you please help me with this issue? Many thanks!

My best, Tianyu

skoren commented 3 years ago

@brettChapman OK, glad you finished the asm. I wouldn't waste much time trying to optimize the assembly given that you'll be using completely different data in the future. I'd see if you can track down why the memory request wasn't being handled properly on the system as that can cause errors in the future.

@Huangtianyu-caas unlikely to be related as this issue was solely due to grid configuration. I see you opened a new issue so I will close this one since the original question is resolved.