Overlapper step failed twice

lilypeck commented 2 months ago

Hello

I am using canu to assemble nanopore reads for a genome which is 830 Mb.

My canu command is

INFILE=barcode02
../canu/canu-2.2/bin/canu \
  useGrid=true gridEngineResourceOption="-pe shared 5 -l h_rt=24:00:00,h_data=8G" \
  -p ${INFILE} genomeSize=830m \
  -nanopore /u/home/l/ldpeck/project-vlsork/longreads/fastq/${INFILE}_ALLpass.fastq.gz

The version is canu 2.2

I am running canu on a linux compute cluster with over 1,300+ 64-bit nodes and over 21,000 cores, with an aggregate of over 50TB of memory.

I have now run the same command twice, and both times it is failing on the overlapper step. I have attached the .report and canu.12.out script file, and the last log file. The bottom rows of the .out script file for both failed runs is

--   job correction/1-overlapper/results/000119.ovb FAILED.
--   job correction/1-overlapper/results/000120.ovb FAILED.
--   job correction/1-overlapper/results/000121.ovb FAILED.
--

ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.

Do you know why the overlapper step is failing? Am I calling enough resource?

Thanks

Lily

canu.12.out.txt 1712859421_n7022_13766_canu.txt barcode02.report.txt

skoren commented 2 months ago

The fact that all your jobs are failing most likely implies a problem with the JVM (see #1863, #2112, or #2111) or interaction between the grid and the JVM. Post the log from one of the failing jobs (correction/1-overlapper/*71*out) which has more details.

lilypeck commented 2 months ago

Thank you for quick response.

Attached is an .out file

It looks like it has been cut off, could this be from my 24:00:00 time limit?

tail correction/1-overlapper/mhap.118.out 
Time (s) to score, hash to-file, and output: 6355.320719013001
Opened fasta file /u/project/vlsork/ldpeck/longreads/barcode02/correction/1-overlapper/blocks/000051.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Current # sequences loaded and processed from file: 40000...

mhap.118.out.txt

skoren commented 2 months ago

It looks like this run has been restarted a lot more than 2 times that canu would have tried. Have you launched this assembly multiple times? Are there any jobs in the grid running now from canu. The end of the file isn't really consistent between the restarts so I wouldn't think it's the 24-hr limit but something else (like launching the job again while it's still running).

lilypeck commented 2 months ago

I have only launched it twice, but it has been running for a while (I re-started it on 5th april and it finished today). I was also originally running it on a different sample at the same time, maybe that confused things. I will delete all the intermediate files and try re-launching it?

skoren commented 2 months ago

As long as you're providing different -d paths to multiple runs they wouldn't interfere with each other but I don't see a -d option in your command so if you ran multiple commands with different -p I think they would likely interfere with each other. Specify both a -d and a -p (you can use the same value for both) and try from scratch.

lilypeck commented 2 months ago

Okay, thank you I will let you know how it goes.

JWDebler commented 2 months ago

I am having the same problem, but I'm running on a local computer. It's running as part of a nextflow pipeline using this image: canu 2.2 I've only seen it fail before when the input reads are of much higher coverage than the genome (last time ~1000x). This run has ~400x coverage.

Here's the output of one of the failed overlaps:

cat mhap.000154.out

Found perl:
   /usr/local/bin/perl
   This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi

Found java:
   /usr/local/bin/java
   openjdk version "11.0.9.1-internal" 2020-11-04

Found canu:
   /usr/local/bin/canu
   canu 2.2

Running job 154 based on command line options.
Fetch blocks/000061.dat
Fetch blocks/000062.dat

Running block 000060 in query 000154

mkfifo: cannot create fifo '000154-pipe': Operation not supported
ERROR:  invalid arg '000154-pipe'
usage: /usr/local/bin/mhapConvert -S seqStore -o output.ovb input.mhap[.gz]
  Converts mhap native output to ovb
    -minlength X    discards overlaps below X bp long.
ERROR:  no overlap files supplied
Running with these settings:
--filter-threshold = 1.0E-7
--help = false
--max-shift = 0.2
--min-olap-length = 500
--min-store-length = 0
--no-rc = false
--no-self = false
--no-tf = false
--num-hashes = 256
--num-min-matches = 3
--num-threads = 16
--ordered-kmer-size = 14
--ordered-sketch-size = 1000
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = true
--supress-noise = 0
--threshold = 0.8
--version = false
-f =
-h = false
-k = 16
-p =
-q = queries/000154
-s = ./blocks/000060.dat

Processing files for storage in reverse index...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Current # sequences loaded and processed from file: 40000...
Current # sequences loaded and processed from file: 45000...
Current # sequences stored: 5000...
Current # sequences stored: 10000...
Current # sequences stored: 15000...
Current # sequences stored: 20000...
Current # sequences stored: 25000...
Current # sequences stored: 30000...
Current # sequences stored: 35000...
Current # sequences stored: 40000...
Current # sequences stored: 45000...
Stored 48814 sequences in the index.
Processed 48814 unique sequences (fwd and rev).
Time (s) to read and hash from file: 2.9753876790000002
Time (s) to score and output to self: 8.771171971000001
Opened fasta file /mnt/i/2024-04-09_Huyen/work/19/44d6fbe068cd42070c24c18153e810/N3/correction/1-overlapper/blocks/000061.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Processed 24568 to sequences.
Time (s) to score, hash to-file, and output: 14.190479572000001
Opened fasta file /mnt/i/2024-04-09_Huyen/work/19/44d6fbe068cd42070c24c18153e810/N3/correction/1-overlapper/blocks/000062.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Processed 22503 to sequences.
Time (s) to score, hash to-file, and output: 10.631794417
Total scoring time (s): 33.616169825
Total time (s): 36.592207558000005
MinHash search time (s): 47.586744411000005
Total matches found: 6803961
Average number of matches per lookup: 95.18958280869639
Average number of table elements processed per lookup: 5353.361369932007
Average number of table elements processed per match: 56.23894140486696
Average % of hashed sequences hit per lookup: 8.130418856063843
Average % of hashed sequences hit that are matches: 2.3984579392192376
Average % of hashed sequences fully compared that are matches: 97.39085763132739

and here for one that worked:

 cat mhap.000156.out
                                                                                                                                                                                                                                                                                                                            Found perl:
   /usr/local/bin/perl
   This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi

Found java:
   /usr/local/bin/java
   openjdk version "11.0.9.1-internal" 2020-11-04

Found canu:
   /usr/local/bin/canu
   canu 2.2

Running job 156 based on command line options.

Running block 000062 in query 000156

mkfifo: cannot create fifo '000156-pipe': Operation not supported
Running with these settings:
--filter-threshold = 1.0E-7
--help = false
--max-shift = 0.2
--min-olap-length = 500
--min-store-length = 0
--no-rc = false
--no-self = false
--no-tf = false
--num-hashes = 256
--num-min-matches = 3
--num-threads = 16
--ordered-kmer-size = 14
--ordered-sketch-size = 1000
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = true
--supress-noise = 0
--threshold = 0.8
--version = false
-f =
-h = false
-k = 16
-p =
-q = queries/000156
-s = ./blocks/000062.dat

Processing files for storage in reverse index...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Current # sequences loaded and processed from file: 40000...
Current # sequences loaded and processed from file: 45000...
Current # sequences stored: 5000...
Current # sequences stored: 10000...
Current # sequences stored: 15000...
Current # sequences stored: 20000...
Current # sequences stored: 25000...
Current # sequences stored: 30000...
Current # sequences stored: 35000...
Current # sequences stored: 40000...
Current # sequences stored: 45000...
Stored 45006 sequences in the index.
Processed 45006 unique sequences (fwd and rev).
Time (s) to read and hash from file: 2.6082312940000003
Time (s) to score and output to self: 6.021960087
Total scoring time (s): 6.025997296000001
Total time (s): 8.634841948
MinHash search time (s): 12.860106118000001
Total matches found: 937273
Average number of matches per lookup: 41.65102430787006
Average number of table elements processed per lookup: 4338.265564591387
Average number of table elements processed per match: 104.1574759968547
Average % of hashed sequences hit per lookup: 6.849936882708943
Average % of hashed sequences hit that are matches: 1.351041534848153
Average % of hashed sequences fully compared that are matches: 97.42020781774735

lilypeck commented 2 months ago

Hi @skoren I think wiping the files and restarting with the -d flag has worked, as now the mhap.XX.out files only go up to mhap.10.out and in the most recent canu.out file it says

-- BEGIN CORRECTION
--
-- OVERLAPPER (mhap) (correction) complete, not rewriting scripts.
--
-- All 53 mhap precompute jobs finished successfully.
-- Finished stage 'cor-mhapPrecomputeCheck', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
--
-- 'mhap.jobSubmit-01.sh' -> job 2656968 tasks 1-131.
--
----------------------------------------
-- Starting command on Thu Apr 11 22:05:04 2024 with 4877.973 GB free disk space

    cd /u/project/vlsork/ldpeck/longreads/canu/barcode02
    qsub \
      -hold_jid 2656968 \
      -pe shared 5 \
      -l h_rt=24:00:00,h_data=8G   \
      -cwd \
      -N 'canu_barcode02' \
      -j y \
      -o canu-scripts/canu.05.out  canu-scripts/canu.05.sh
Your job 2656969 ("canu_barcode02") has been submitted

Thank you!!

skoren commented 2 months ago

Thanks for the update @lilypeck

@JWDebler doesn't look related, this seems like your system doesn't support named pipes, see #2167. I'd guess you are on a VM or on Windows. You can add the option of mhapPipe=False to your command, remove all files from the assembly, and re-run.

JWDebler commented 2 months ago

@skoren correct, it's running on WSL2. Adding mhapPipe=False fixed it. Thanks :-)

marbl / canu

Overlapper step failed twice #2306