marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Get polyG/A/T reads after trimming #308

Closed WeichenZhou closed 7 years ago

WeichenZhou commented 7 years ago

Dear Sir, I just started to use CANU and wanted to analyze Human Genome PacBio data. So I used the scripts step by step as following: correcting.. ./canu useGrid=true gridEngine="pbs" gridOptionsJobName="canu_test1" gridEngineThreadsOption="-l nodes=3:ppn=4" gridEngineMemoryOption="-l pmem=4gb,qos=flux,wall time=100:00:00" gridOptions="-m abe -V -d . " -correct -d 16.12.05.canu_consensus -p wash genomeSize=3g -pacbio-raw SRR1.fastq then trimming.. ./canu useGrid=true gridEngine="pbs" gridOptionsJobName="canu_test_trim1" gridEngineThreadsOption="-l nodes=3:ppn=4" gridEngineMemoryOption="-l pmem=4gb,qos=flux, walltime=100:00:00" gridOptions="-m abe -V -d ." -trim -d 16.12.05.canu_consensus -p wash genomeSize=3g -pacbio-corrected wash.correctedReads.fasta.1.gz The corrected fasta.gz looks pretty good, which has 20k number of reads for this SRR1.fastq file. While the trimmed fasta.gz contains reads only having ployT/A/G, and the number decreases to 9k. Besides, SRR1.fastq is just one of 200 human genome PacBio sequencing files and has 50k raw reads. Have you ever met that problem before? I would appreciate if you can give some advice to handle this. Thank you so much!

Arthur

brianwalenz commented 7 years ago

Correction works by building consensus sequence for each read - think of it as mini assemblies for each read. By using 1/200 of the data, you're only correcting the repeats (or, probably, just sequencing noise). You need to correct all data at the same time.

WeichenZhou commented 7 years ago

Thank you so much for your kind reply. That must be the reason. So I have one more question: If I analyze data in one command (not step by step), will it have median-step result files left, e.g. correctedReads.fasta.gz, trimmedReads.fasta.gz?.. Thank you so much! -Arthur

WeichenZhou commented 7 years ago

So I run the total 200 files in one time. But it didn't have results and the canu.01.out file said this:


-- Canu v1.3 (+367 commits) r7983 fd38ffccd8674d78312d343b0717e0d6b55d823d.
-- Detected Java(TM) Runtime Environment '1.8.0_111' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 2' (from 'gnuplot') and image format 'png'.
-- Detected 16 CPUs and 63 gigabytes of memory.
-- Detecting PBS/Torque resources.
pbsnodes: /nfs/turbo/remillsdata/local/apps/images/anaconda/2/lib/liblzma.so.5: no version information available (required by /lib64/libselinux.so.1)
-- 
-- Found   4 hosts with  16 cores and  125 GB memory under PBS/Torque control.
-- Found 467 hosts with  16 cores and   62 GB memory under PBS/Torque control.
-- Found 206 hosts with  24 cores and  125 GB memory under PBS/Torque control.
-- Found   5 hosts with   8 cores and   47 GB memory under PBS/Torque control.
-- Found   1 host  with  20 cores and   78 GB memory under PBS/Torque control.
-- Found   4 hosts with  40 cores and 1007 GB memory under PBS/Torque control.
-- Found 150 hosts with  20 cores and   94 GB memory under PBS/Torque control.
-- Found   3 hosts with  20 cores and  251 GB memory under PBS/Torque control.
-- Found   4 hosts with  32 cores and  503 GB memory under PBS/Torque control.
-- Found   1 host  with  12 cores and   43 GB memory under PBS/Torque control.
-- Found   5 hosts with  32 cores and 1007 GB memory under PBS/Torque control.
-- Found   1 host  with  56 cores and 1464 GB memory under PBS/Torque control.
-- Found   1 host  with  56 cores and 1495 GB memory under PBS/Torque control.
-- Found   6 hosts with  20 cores and   62 GB memory under PBS/Torque control.
-- Found   5 hosts with  56 cores and 1511 GB memory under PBS/Torque control.
-- Found 395 hosts with  12 cores and   47 GB memory under PBS/Torque control.
-- Found   4 hosts with  16 cores and  251 GB memory under PBS/Torque control.
--
-- Allowed to run under grid control, and use up to  16 compute threads and  251 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    4 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   78 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   32 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to  20 compute threads and   78 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #3, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 3000000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
ls: /nfs/turbo/remillsdata/local/apps/images/anaconda/2/lib/liblzma.so.5: no version information available (required by /lib64/libselinux.so.1)
--
-- meryl failed.
--
================================================================================
Don't panic, but a mostly harmless error occurred and canu failed.

canu failed with 'failed to generate mer counts.  Made 2 attempts, jobs still failed'.

Do you have any idea about why was that? Thank you so much for your time.

-Arthur

brianwalenz commented 7 years ago

To answer the earlier question: yes, the full run will have the same outputs as -correct and -trim.

The failure looks like canu is not able to run pbsnodes or ls. I'd guess that PBS is failing to set up the environment correctly, LD_LIBRARY_PATH maybe?. You can use canu option gridOptions to pass options to the PBS submit command, but I don't know PBS so I can't suggest options to try.

http://canu.readthedocs.io/en/stable/parameter-reference.html#grid-options

WeichenZhou commented 7 years ago

Thank you so much for your kind reply. Actually, ls: /nfs/turbo/remillsdata/local/apps/images/anaconda/2/lib/liblzma.so.5: no version information available (required by /lib64/libselinux.so.1) is not the very problem I think, because every command I use will come this line since there is some configuration problem in our platform. Other than that, is it possible because of the memory usage? So I use the same command I mentioned before: ./canu useGrid=true gridEngine="pbs" gridOptionsJobName="canu_test1" gridEngineThreadsOption="-l nodes=3:ppn=4" gridEngineMemoryOption="-l pmem=4gb,qos=flux,wall time=100:00:00" gridOptions="-m abe -V -d . " -correct -d 16.12.05.canu_consensus -p wash genomeSize=3g -pacbio-raw SRR1.fastq But the first try I just used one SRR.fastq file, now I am using all 200 SRR*.fastq files. I will try more options about gridOptions. Thank you so much for your time!! -Arthur

WeichenZhou commented 7 years ago

And I rerun the job, still same error:

...
-- meryl failed.
--
================================================================================
Don't panic, but a mostly harmless error occurred and canu failed.

canu failed with 'failed to generate mer counts.  Made 2 attempts, jobs still failed'.
brianwalenz commented 7 years ago

What is in correctioin/0-mercounts/*.err (or possibly *.out)?

WeichenZhou commented 7 years ago

There is no .err file. The correction/0-mercounts/meryl.1.out-1 is like this:

Computing 60 segments using 20 threads and 63880MB memory (67953MB if in one batch).
  numMersActual      = 183210577514
  mersPerBatch       = 4485493555
  basesPerBatch      = 3055841489
  numBuckets         = 134217728 (27 bits)
  bucketPointerWidth = 32
  merDataWidth       = 5
Computing segment 7 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 49 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 22 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 19 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 10 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 4 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 1 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 13 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
Computing segment 16 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
 Allocating 512MB for counting the size of each bucket.
Computing segment 25 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
Computing segment 28 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 31 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
Computing segment 34 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 37 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 40 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 43 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
Computing segment 46 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
 Allocating 512MB for counting the size of each bucket.
Computing segment 52 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 55 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 58 of 60.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
 Allocating 512MB for counting the size of each bucket.
 Allocating 512MB for counting the size of each bucket.
 Counting mers in buckets:    2.10 Mmers --  2.63 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.99 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  4.00 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  2.04 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  2.06 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.64 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.83 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.81 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.64 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  2.99 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.77 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  2.02 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  2.86 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.61 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.84 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.32 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.39 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  3.24 Mmers/second^M Counting mers in buckets:    6.29 Mmers --  3.48 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  2.60 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.49 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  1.92 Mmers/second^M Counting mers in buckets:    6.29 Mmers --  4.07 Mmers/second^M Counting mers in buckets:    6.29 Mmers --  3.17 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  2.02 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  2.48 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.08 Mmers/second^M Counting mers in buckets:    6.29 Mmers --  3.12 Mmers/second
^M Counting mers in buckets:    8.39 Mmers --  4.64 Mmers/second^M Counting mers in buckets:    6.29 Mmers --  2.69 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  1.50 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.05 Mmers/second^M Counting mers in buckets:    8.39 Mmers --  3.44 Mmers/second^M Counting mers in buckets:    4.19 Mmers --  1.89 Mmers/second^M Counting mers in buckets:    2.10 Mmers --  1.02 Mmers/second^M Counting mers in buck

It looks a little wired and comes the pattern Counting mers in buckets: 16.78 Mmers -- 1.89 Mmers/second^M at the end of file. Thank you so much. -Arthur

brianwalenz commented 7 years ago

I don't see any errors in there. Was the job killed by LSF for exceeding memory or time limits?

If you edit meryl.sh to remove the '-v' switch, it will stop reporting progress, which might reveal the error. You can also try resubmitting meryl.sh by hand (see scripts/canu.*.out for the submission command) or even running it manually.

Finally, what is in the 0-mercounts directory? This should indicate (to me at least) what stage the meryl program made it to. It has three stages: an initial parallel scan of the reads, a parallel compute which again scans all the reads but now uses lots of memory, and lastly a sequential merge of the results. This equates to: (1) lots of I/O, (2) lots of I/O and lots of memory, (3) lots of time.

WeichenZhou commented 7 years ago

Thank you so much for your reply. In 0-mercounts directory, there are only 3 files: meryl.1.out-1 meryl.jobSubmit.sh and meryl.sh. As in my PBS reports, it looks like CANU run and got output meryl.$PBS_ARRAYID.out-1, and then tried to run and failed, error output canu.01.out. However, CANU tried another time to submit jobs, failed and have meryl.$PBS_ARRAYID.out-1 and canu.01.out again. It cost a couple of hours to get meryl.$PBS_ARRAYID.out-1, but canu.01.out only several seconds. I am not sure which stage it failed, but I would like to try a large memory. -Arthur

WeichenZhou commented 7 years ago

Dear authors, I assigned a larger memory and it went on running. After 12 days, CANU seems to stop. The most recent updated files are in ~/correction/1-overlapper/results, with n.ovb and n.counts. Also, it has ~/correction/wash.ovlStore.BUILDING directory, which contains bucket0n folders and results. Do you know why was that? and, can I have a way to continue running CANU since I don't want to waste all these 12-days-results. Thank you so much and merry Christmas! -Arthur

skoren commented 7 years ago

What is the output in the last file in canu-scripts/canu.*.out? That should give a reason for the failure. If the wash.ovlStore.BUILDING has only bucket0n folders but no ovs files, it is safe to re-run the same Canu command you used originally and it will resume the job. If you have ovs files you can remove wash.ovlStore.BUILDING and Canu will re-build the store from the results files.

WeichenZhou commented 7 years ago

Thanks for your kind reply!! Actually, the job was continuing to run at midnight 4 days ago after my last message (I felt a little surprised and did not know why). So now CANU is still running and it has some logs in correction/wash.ovlStore.BUILDING/logs/2-sort.$PBS_ARRAYID.out-1426 like this. Hope it goes well. Could you have any idea about how long it will take to complete? I appreciate your time. -Arthur

skoren commented 7 years ago

This step is completely I/O bound so it depends on the speed of your filesystem. Typically, on our system for a human genome, this step completes in a few hours on either GPFS or NFS. You can check how many jobs were submitted for the 2-sort step and how many log files in 2-sort.*.out end with "Success" to estimate what percent is done.

WeichenZhou commented 7 years ago

Thank you for the quick reply. I realized that CANU rerun 2-sort now because I have seen all logs 1 day ago. I check /canu-scripts, and find there are 6 files:

8722 Dec 17 00:56 canu.01.out
22631 Dec 12 22:22 canu.01.sh
8640 Dec 25 00:02 canu.02.out
22667 Dec 24 09:56 canu.02.sh
211550 Dec 27 11:06 canu.03.out
22703 Dec 27 11:06 canu.03.sh

In canu.03.out:

-- Canu v1.3 (+367 commits) r7983 fd38ffccd8674d78312d343b0717e0d6b55d823d.
-- Detected Java(TM) Runtime Environment '1.8.0_111' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 2' (from 'gnuplot') and image format 'png'.
-- Detected 32 CPUs and 1008 gigabytes of memory.
-- Detecting PBS/Torque resources.
-- 
-- Found   4 hosts with  16 cores and  125 GB memory under PBS/Torque control.
-- Found 468 hosts with  16 cores and   62 GB memory under PBS/Torque control.
-- Found 208 hosts with  24 cores and  125 GB memory under PBS/Torque control.
-- Found   6 hosts with   8 cores and   47 GB memory under PBS/Torque control.
-- Found   1 host  with  20 cores and   78 GB memory under PBS/Torque control.
-- Found   4 hosts with  40 cores and 1007 GB memory under PBS/Torque control.
-- Found 151 hosts with  20 cores and   94 GB memory under PBS/Torque control.
-- Found   3 hosts with  20 cores and  251 GB memory under PBS/Torque control.
-- Found   4 hosts with  32 cores and  503 GB memory under PBS/Torque control.
-- Found   5 hosts with  32 cores and 1007 GB memory under PBS/Torque control.
-- Found   1 host  with  56 cores and 1464 GB memory under PBS/Torque control.
-- Found   1 host  with  56 cores and 1495 GB memory under PBS/Torque control.
-- Found   6 hosts with  20 cores and   62 GB memory under PBS/Torque control.
-- Found   5 hosts with  56 cores and 1511 GB memory under PBS/Torque control.
-- Found 399 hosts with  12 cores and   47 GB memory under PBS/Torque control.
-- Found   4 hosts with  16 cores and  251 GB memory under PBS/Torque control.
--
-- Allowed to run under grid control, and use up to  16 compute threads and  251 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    4 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   78 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   32 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to  20 compute threads and   78 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #2, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--
--   genomeSize          -- 3000000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
--
-- 1697 overlap store sorter jobs failed:
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0001 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0002 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0003 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0004 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0005 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0006 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0007 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0008 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0009 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0010 FAILED.
......
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/1698 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/1699 FAILED.
--
-- overlap store sorter attempt 2 begins with 39 finished, and 1697 to compute.
----------------------------------------
-- Starting command on Tue Dec 27 11:06:11 2016 with 284954.898 GB free disk space

      qsub \
        -m abe -V -d . -q fluxod -A remills_fluxod  -l pmem=64gb,qos=flux,walltime=100:00:00 -l nodes=2:ppn=2:largemem \
        -d `pwd` -N "ovS_wash_canu_test1" \
        -t 1-409 \
        -j oe -o /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/logs/2-sort.\$PBS_ARRAYID.out \
        /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/scripts/2-sort.sh 
-- Finished on Tue Dec 27 11:06:11 2016 (1 second) with 284954.898 GB free disk space
----------------------------------------
----------------------------------------
-- Starting command on Tue Dec 27 11:06:11 2016 with 284954.898 GB free disk space

      qsub \
        -m abe -V -d . -q fluxod -A remills_fluxod  -l pmem=64gb,qos=flux,walltime=100:00:00 -l nodes=2:ppn=2:largemem \
        -d `pwd` -N "ovS_wash_canu_test1" \
        -t 412-1699 \
        -j oe -o /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/logs/2-sort.\$PBS_ARRAYID.out \
        /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/scripts/2-sort.sh 

-- Finished on Tue Dec 27 11:06:11 2016 (lickety-split) with 284954.898 GB free disk space
----------------------------------------

Warning: multiple IDs for job ovS_wash_canu_test1 got 3 and should have been 1.

----------------------------------------
-- Starting command on Tue Dec 27 11:06:13 2016 with 284954.989 GB free disk space

    qsub \
      -l pmem=64gb,qos=flux,walltime=100:00:00 \
      -l nodes=2:ppn=2:largemem \
      -m abe \
      -V \
      -d . \
      -q fluxod \
      -A remills_fluxod  \
      -W depend=afteranyarray:21848145[].nyx.arc-ts.umich.edu:21853804[].nyx.arc-ts.umich.edu:21853805[].nyx.arc-ts.umich.edu \
      -d `pwd` \
      -N "canu_wash_canu_test1" \
      -j oe \
      -o /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/canu-scripts/canu.03.out /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/canu-scripts/canu.03.sh

21853806.nyx.arc-ts.umich.edu

-- Finished on Tue Dec 27 11:06:13 2016 (lickety-split) with 284954.989 GB free disk space
----------------------------------------

So I got a little confused now whether my program is running on the right track. Thank you so much!

--Arthur

skoren commented 7 years ago

It is ok that it retried but it looks like the majority of the jobs are failing. What do some of the failed job logs have in the output (in logs/2-sort.*.out).

WeichenZhou commented 7 years ago

Thanks for your reply. This morning I got another canu.03.out, and it says:

-- Canu v1.3 (+367 commits) r7983 fd38ffccd8674d78312d343b0717e0d6b55d823d.
-- Detected Java(TM) Runtime Environment '1.8.0_111' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 2' (from 'gnuplot') and image format 'png'.
-- Detected 40 CPUs and 1008 gigabytes of memory.
-- Detecting PBS/Torque resources.
-- 
-- Found   4 hosts with  16 cores and  125 GB memory under PBS/Torque control.
-- Found 468 hosts with  16 cores and   62 GB memory under PBS/Torque control.
-- Found 208 hosts with  24 cores and  125 GB memory under PBS/Torque control.
-- Found   6 hosts with   8 cores and   47 GB memory under PBS/Torque control.
-- Found   1 host  with  20 cores and   78 GB memory under PBS/Torque control.
-- Found   4 hosts with  40 cores and 1007 GB memory under PBS/Torque control.
-- Found 151 hosts with  20 cores and   94 GB memory under PBS/Torque control.
-- Found   3 hosts with  20 cores and  251 GB memory under PBS/Torque control.
-- Found   4 hosts with  32 cores and  503 GB memory under PBS/Torque control.
-- Found   5 hosts with  32 cores and 1007 GB memory under PBS/Torque control.
-- Found   1 host  with  56 cores and 1464 GB memory under PBS/Torque control.
-- Found   1 host  with  56 cores and 1495 GB memory under PBS/Torque control.
-- Found   6 hosts with  20 cores and   62 GB memory under PBS/Torque control.
-- Found   5 hosts with  56 cores and 1511 GB memory under PBS/Torque control.
-- Found 399 hosts with  12 cores and   47 GB memory under PBS/Torque control.
-- Found   4 hosts with  16 cores and  251 GB memory under PBS/Torque control.
--
-- Allowed to run under grid control, and use up to  16 compute threads and  251 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    4 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   78 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   32 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to  20 compute threads and   78 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   19 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #3, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 3000000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
--
-- 1697 overlap store sorter jobs failed:
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0001 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0002 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0003 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0004 FAILED.
......
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/1698 FAILED.
--   job /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/1699 FAILED.
--
================================================================================
Don't panic, but a mostly harmless error occurred and canu failed.

canu failed with 'failed to overlapStoreSorter.  Made 2 attempts, jobs still failed'.

When I checked /correction/wash.ovlStore.BUILDING, there are 7284 files/directories inside, with very recent N & N.index files (0 bytes) and the same number of older N, N.index& N.evalueLen files (several G/M/K). Also, I checked the /correction/wash.ovlStore.BUILDING/logs, one of 2-sort.*.outs says:

Running job 1497 based on PBS_ARRAYID=1497 and offset=0.
Max processes per user limited to 4127233, no increase possible.

Max open files limited to 32768, no increase possible.

Found 590709 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0004/sliceSizes'.
Found 1950949 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0008/sliceSizes'.
Found 1997091 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0012/sliceSizes'.
......
Found 25177598 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0401/sliceSizes'.
Overlaps need 3.77 GB memory, allowed to use up to (via -M) 4 GB.
Loading 590709 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0004/slice1497'.
Loading 1950949 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0008/slice1497'.
Loading 1997091 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0012/slice1497'.
......
Loading 25177598 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0401/slice1497'.
Sorting.
Writing output.
Writing 101183081 overlaps.

Failed with 'Segmentation fault'

Backtrace (mangled):

/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x4058fd]
/lib64/libpthread.so.0(+0xf100)[0x2aed9fc30100]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x40ff4a]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x40f329]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x40d4ab]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x402b08]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aed9fe5eb15]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x403189]

Backtrace (demangled):

[0] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x4058fd]
[1] /lib64/libpthread.so.0::(null) + 0xf100  [0x2aed9fc30100]
[2] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x40ff4a]
[3] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x40f329]
[4] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x40d4ab]
[5] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x402b08]
[6] /lib64/libc.so.6::(null) + 0xf5  [0x2aed9fe5eb15]
[7] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x403189]

GDB:

/var/spool/torque/mom_priv/jobs/21853805-1497.nyx.arc-ts.umich.edu.SC: line 78:  1518 Segmentation fault      (core dumped) $bin/ovStoreSorter -deletelate -M 4 -O /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING -G /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.gkpStore -F 1736 -job $jobid 407
Failure.

So I saw a 'Segmentation fault' here. It is because of not enough memory? I think it was assigned at least 64GB (in 4 cores), but the log said 'Overlaps need 3.77 GB memory, allowed to use up to (via -M) 4 GB'.

Thank you so much!/Arthur

skoren commented 7 years ago

Memory isn't an issue, it is going to use 4GB since that's what it was configured for, requesting more won't affect it. It completed everything and was generating output which makes me suspect an issue with file access. Can you try running one of the failed jobs by hand to see if it succeeds (you can just run sh scripts/2-sort.sh 1) within the asm.ovlStore.BUILDING folder. If that still crashes, try recompiling Canu with debug (make clean && make BUILDDEBUG=1) and re-running within GDB to see where exactly it is failing.

WeichenZhou commented 7 years ago

Following your instruction, it turns out:

[arthurz@flux-login3 wash.ovlStore.BUILDING]$ sh scripts/2-sort.sh 1
Running job 1 based on command line options.
Changed max processes per user from 4096 to 192431 (max 192431).

Changed max open files from 1024 to 4096 (max 4096).

Found 28889424 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0001/sliceSizes'.
Found 16480997 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0002/sliceSizes'.
Found 16584953 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0003/sliceSizes'.
Found 38866132 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0004/sliceSizes'.
Overlaps need 3.76 GB memory, allowed to use up to (via -M) 4 GB.
Loading 28889424 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0001/slice001'.
Loading 16480997 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0002/slice001'.
Loading 16584953 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0003/slice001'.
Loading 38866132 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/bucket0004/slice001'.
Sorting.
Writing output.
Writing 100821506 overlaps.

Failed with 'Segmentation fault'

Backtrace (mangled):

/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x4058fd]
/lib64/libpthread.so.0(+0xf100)[0x2b3b7d76b100]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x40ff4a]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x40f329]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x40d4ab]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x402b08]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3b7d999b15]
/home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter[0x403189]

Backtrace (demangled):

[0] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x4058fd]
[1] /lib64/libpthread.so.0::(null) + 0xf100  [0x2b3b7d76b100]
[2] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x40ff4a]
[3] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x40f329]
[4] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x40d4ab]
[5] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x402b08]
[6] /lib64/libc.so.6::(null) + 0xf5  [0x2b3b7d999b15]
[7] /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter() [0x403189]

GDB:

scripts/2-sort.sh: line 78: 25486 Segmentation fault      (core dumped) $bin/ovStoreSorter -deletelate -M 4 -O /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING -G /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.gkpStore -F 1736 -job $jobid 407
Failure.

So 'make clean && make BUILDDEBUG=1', and re-run:

[arthurz@flux-login3 wash.ovlStore.BUILDING]$ sh scripts/2-sort.sh 1
Running job 1 based on command line options.
Changed max processes per user from 4096 to 192431 (max 192431).

Changed max open files from 1024 to 4096 (max 4096).

Job 1 is running or finished (remove '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0001.ovs' or -force to try again).
Success.

Sorry I did not get what is "re-running within GDB", but it turns out a success by build debug=1. So I am wondering if there is a way to resume the jobs by any chance. Thank you so much for your time.

-Arthur

skoren commented 7 years ago

After the debug build, run

rm /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/0001.ovs
gdb /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter
set args -deletelate -M 4 -O /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/ -G /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.gkpStore <fill in -F and -job options from your 2-sort.sh script replacing $jobid with 1>
r

This will run for a while in the debugger. When it crashes run

bt

And post the output, that should give the trace of the line where the program is failing.

WeichenZhou commented 7 years ago

Thank you so much for your quick reply! Sorry I just forget run bt. So after r, it turns out:

Starting program: /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter -deletelate -M 4 -O /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/ -G /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.gkpStore -F 1736 -job $jobid 1
warning: File "/sw/arcts/centos7/gcc/5.4.0/lib64/libstdc++.so.6.0.21-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
    add-auto-load-safe-path /sw/arcts/centos7/gcc/5.4.0/lib64/libstdc++.so.6.0.21-gdb.py
line to your configuration file "/home/arthurz/.gdbinit".
To completely disable this security protection add
    set auto-load safe-path /
line to your configuration file "/home/arthurz/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
    info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaab9db167 in ____strtoll_l_internal () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.6.x86_64

Then I make the "/home/arthurz/.gdbinit" and add the line, and re-run GDB:

Reading symbols from /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter...(no debugging symbols found)...done.
(gdb) set args -deletelate -M 4 -O /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/ -G /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.gkpStore -F 1736 -job $jobid 1
(gdb) r
Starting program: /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter -deletelate -M 4 -O /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/ -G /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.gkpStore -F 1736 -job $jobid 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaab9db167 in ____strtoll_l_internal () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.6.x86_64
(gdb) bt
#0  0x00002aaaab9db167 in ____strtoll_l_internal () from /lib64/libc.so.6
#1  0x0000000000402d08 in main ()

-Arthur

skoren commented 7 years ago

No this error is because you passed $jobid which is invalid, it should be 1. The 2-sort.sh script should have a line like -job $jobid <large number> where \<large number> is the number of partitions you had. You need to set -job 1 <large number> not -job $jobid 1

WeichenZhou commented 7 years ago

I see, thank you so much for your correction and patience!

[arthurz@flux-login3 ~]$ gdb /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter...(no debugging symbols found)...done.
(gdb) set args -deletelate -M 4 -O /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/ -G /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.gkpStore -F 1736 -job 1 407
(gdb) r
Starting program: /home/arthurz/app/canu/Linux-amd64/bin/ovStoreSorter -deletelate -M 4 -O /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING/ -G /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.gkpStore -F 1736 -job 1 407
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Found 28889424 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket0001/sliceSizes'.
Found 16480997 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket0002/sliceSizes'.
Found 16584953 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket0003/sliceSizes'.
Found 38866132 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket0004/sliceSizes'.
Overlaps need 3.76 GB memory, allowed to use up to (via -M) 4 GB.
Loading 28889424 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket0001/slice001'.
Loading 16480997 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket0002/slice001'.
Loading 16584953 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket0003/slice001'.
Loading 38866132 overlaps from '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket0004/slice001'.
Sorting.
Writing output.
Writing 100821506 overlaps.

Program received signal SIGSEGV, Segmentation fault.
0x000000000040ff4a in ovStoreHistogram::addOverlap(ovOverlap*) ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.6.x86_64
(gdb) bt
#0  0x000000000040ff4a in ovStoreHistogram::addOverlap(ovOverlap*) ()
#1  0x000000000040f329 in ovFile::writeOverlap(ovOverlap*) ()
#2  0x000000000040d4ab in ovStoreWriter::writeOverlaps(ovOverlap*, unsigned long) ()
#3  0x0000000000402b08 in main ()

-Arthur

skoren commented 7 years ago

OK, that gives some information, this code is new since Canu 1.3. Can you share your /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.gkpStore and all overlaps for one bucket (so all files named /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket000*/slice001 and /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/wash.ovlStore.BUILDING//bucket000*/sliceSizes). You can put them on our ftp site: ftp://ftp.cbcb.umd.edu/incoming/sergek

WeichenZhou commented 7 years ago

I just uploaded the files that you required to ftp site. Its name is CANU_308_Arthur_upload.tar.gz. Thank you so much for debugging this. Happy new year :) -Arthur

brianwalenz commented 7 years ago

I finally had a chance to look at this.

There seems to be one overlap job (in 1-overlapper) with corrupt output data, job number 2. But I only have partial output from only four jobs. Given that every one of the sorting jobs failed, I'm guessing that there are many more overlap jobs with corrupt data. Someone else reported something like this, and we just restarted.

Before we do that, can you attach the shell scripts from correction/1-overlapper (you'll need to attach them to the web version, not an email reply)? I'd like to try to reproduce the error on our machines.

WeichenZhou commented 7 years ago

Thank you so much for your reply. Hope this problem will be handled soon. So, in correction/1-overlapper there are four .sh files. I just change their postfix to .txt for uploading.

mhap.txt mhap.jobSubmit.txt precompute.jobSubmit.txt precompute.txt

Thank you so much! --Arthur

skoren commented 7 years ago

One possibility is your grid was killing some jobs and Canu did not detect the error. Can you run the command

grep "Read, hashed, and stored file" precompute*.out|wc -l

In your 1-overlapper folder? There should be 164 total since that is how many precompute jobs you had. The submit command doesn't seem correct also because it is asking for 64g and nodes:2:ppn=2 whereas the actual process is trying to use 19g and 5 cores. Are you still using:

gridEngineThreadsOption="-l nodes=3:ppn=4" gridEngineMemoryOption="-l pmem=64gb,qos=flux,wall time=100:00:00

from an earlier comment? That is not correct, it will try to reserve three machines with 4 cores on them where as all parallel jobs in Canu use shared-memory threads so have to be on a single machine. It will also always request the same memory limit rather than letting Canu specify one. This has probably at least partially been causing some of your problems because your grid has machines with >64g of ram so some processes will be configured to use them and get killed when only requesting 64gb of ram. You should use:

gridEngineThreadsOption="-l nodes=1:ppn=THREADS" gridEngineMemoryOption="-l pmem=MEMORY" gridEngineOptions="qos=flux,wall time=100:00:00"

The last parameter is meant to specify required common option for all jobs on submitted.

WeichenZhou commented 7 years ago

Thank you for the quick reply. So for the first command line, the result is 163 instead of 164. For the second command line, I did it from the beginning when I submit the CANU job. So you mean I should change the command specifically for the 1-overlapper step. Is my understanding right? Thank you! --Arthur

skoren commented 7 years ago

So i'm pretty sure that's the error, can you see which file is missing that output and upload it? Unfortunately this means you will need to go back to re-run the overlapping step. To do so, remove

rm -rf correction/2-correction
rm -rf correction/1-overlapper/results/*
rm -rf correction/asm.ovlStore*

also remove any output in correction/1-overlapper/blocks/ matching the missing file. So if job precompute.10.out is the one that is missing the final output, remove blocks/0010.*. Make a backup to be safe if you'd like.

Then you should be OK to re-run Canu and have it recompute the failed set. For the command, you should run the Canu command with the modifications I suggested for everything so re-run Canu but with my suggested changes and it should resume and re-compute the failed jobs. Your current command is not correctly allocating memory/processors.

WeichenZhou commented 7 years ago

Yes. precompute.70.out-70 seems to be killed.

Running job 70 based on PBS_ARRAYID=70 and offset=0.
Dumping reads from 3933001 to 3990000 (inclusive).
Starting mhap precompute.
Running with these settings:
--filter-threshold = 5.0E-6
--help = false
--max-shift = 0.2
--min-olap-length = 116
--min-store-length = 0
--no-rc = false
--no-self = false
--no-tf = false
--num-hashes = 256
--num-min-matches = 3
--num-threads = 5
--ordered-kmer-size = 14
--ordered-sketch-size = 1000
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = false
--supress-noise = 0
--threshold = 0.8
--version = false
-f = /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/0-mercounts/wash_U.ms16.frequentMers.ignore.gz
-h = false
-k = 16
-p = /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/1-overlapper/blocks/000070.fasta
-q = /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/1-overlapper/blocks
-s = 

Reading in filter file /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/0-mercounts/wash_U.ms16.frequentMers.ignore.gz.
Time (s) to read filter file: 0.145231995
Read in k-mer filter for sizes: [16]
Processing FASTA files for binary compression...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...

Also I have no correction/2-correction directory and I cannot see any blocks/ direcory under correction/. After I delete files, re-run gridEngineThreadsOption="-l nodes=1:ppn=4" gridEngineMemoryOption="-l pmem=64gb" gridEngineOptions="qos=flux,wall time=100:00:00" for CANU, right? Actually, I used the command like this:

useGrid=true gridEngine="pbs" gridOptionsJobName="canu_test1" gridEngineThread
sOption="-l nodes=2:ppn=2:largemem" gridEngineMemoryOption="-l pmem=64gb,qos=flux,walltime=100:00:00" gridOptions="-m abe -
V -d . -q fluxod -A remills_fluxod" 

Following your suggestion, I think it should be changed as:

useGrid=true gridEngine="pbs" gridOptionsJobName="canu_test1" gridEngineThread
sOption="-l nodes=1:ppn=4:largemem" gridEngineMemoryOption="-l pmem=64gb" gridOptions="qos=flux,walltime=100:00:00, -m abe -
V -d . -q fluxod -A remills_fluxod" 

Thank you so much for your patient. --Arthur

skoren commented 7 years ago

It is OK that you don't have the 2-correction folder, it is created later. Just as a sanity check to make sure partition 70 was causing the errors can you post the result of

ls /home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/1-overlapper/queries/000002/

No, Canu uses THREADS and MEMORY as wildcards to replace during runtime so you should leave THREADS/MEMORY in your command:

useGrid=true gridEngine="pbs" gridOptionsJobName="canu_test1" gridEngineThread
sOption="-l nodes=1:ppn=THREADS:largemem" gridEngineMemoryOption="-l pmem=MEMORY" gridOptions="qos=flux,walltime=100:00:00, -m abe -
V -d . -q fluxod -A remills_fluxod" 
WeichenZhou commented 7 years ago

Thanks! Now I find 1-overlapper/blocks/ directory. Sorry for the mistake. I have files in 1-overlapper/blocks/ like this:

...
1032604400 Dec 14 15:04 000068.dat
1032606832 Dec 15 10:38 000069.dat
125815385 Dec 15 11:09 000070.dat
1249451934 Dec 15 10:39 000070.fasta
1032601584 Dec 15 14:15 000071.dat
1032600208 Dec 15 14:23 000072.dat 
...

--Arthur

skoren commented 7 years ago

Yes so

125815385 Dec 15 11:09 000070.dat
1249451934 Dec 15 10:39 000070.fasta

are what you want to remove. What is in the 1-overlapper/queries/000002 folder? I want to make sure the corrupted overlap file that @brianwalenz saw did actually try to access block 70 otherwise something else is going wrong.

WeichenZhou commented 7 years ago

Thanks!! In 1-overlapper/queries/000002 folder, there are all soft links (41 links):

000043.dat  000047.dat  000051.dat  000055.dat  000059.dat  000063.dat  000067.dat  000071.dat  000075.dat  000079.dat  000083.dat
000044.dat  000048.dat  000052.dat  000056.dat  000060.dat  000064.dat  000068.dat  000072.dat  000076.dat  000080.dat
000045.dat  000049.dat  000053.dat  000057.dat  000061.dat  000065.dat  000069.dat  000073.dat  000077.dat  000081.dat
000046.dat  000050.dat  000054.dat  000058.dat  000062.dat  000066.dat  000070.dat  000074.dat  000078.dat  000082.dat

--Arthur

skoren commented 7 years ago

OK so the bad dat file was used in that job so it makes sense that the overlaps were corrupt. I think removing the files and re-running canu should fix your issue.

WeichenZhou commented 7 years ago

Thank you so much! I just followed to remove the files:

rm -rf correction/2-correction
rm -rf correction/1-overlapper/results/*
rm -rf correction/asm.ovlStore*
rm correction/1-overlapper/block/00070.*

After running the command, I got error like this:

-- Detected Java(TM) Runtime Environment '1.8.0_111' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 2' (from 'gnuplot') and image format 'png'.
-- Detected 32 CPUs and 1008 gigabytes of memory.
-- Detecting PBS/Torque resources.
-- 
-- Found   5 hosts with  32 cores and 1007 GB memory under PBS/Torque control.
-- Found   1 host  with  56 cores and 1464 GB memory under PBS/Torque control.
-- Found   1 host  with  56 cores and 1495 GB memory under PBS/Torque control.
-- Found   4 hosts with  16 cores and  125 GB memory under PBS/Torque control.
-- Found   5 hosts with  20 cores and   62 GB memory under PBS/Torque control.
-- Found 468 hosts with  16 cores and   62 GB memory under PBS/Torque control.
-- Found 204 hosts with  24 cores and  125 GB memory under PBS/Torque control.
-- Found   5 hosts with  56 cores and 1511 GB memory under PBS/Torque control.
-- Found   5 hosts with   8 cores and   47 GB memory under PBS/Torque control.
-- Found 400 hosts with  12 cores and   47 GB memory under PBS/Torque control.
-- Found 149 hosts with  20 cores and   94 GB memory under PBS/Torque control.
-- Found   4 hosts with  40 cores and 1007 GB memory under PBS/Torque control.
-- Found   4 hosts with  16 cores and  251 GB memory under PBS/Torque control.
-- Found   3 hosts with  20 cores and  251 GB memory under PBS/Torque control.
-- Found   4 hosts with  32 cores and  503 GB memory under PBS/Torque control.
--
-- Allowed to run under grid control, and use up to  16 compute threads and  251 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    4 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   94 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   32 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to  20 compute threads and   94 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 3000000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
overlapper output '/home/arthurz/arthur_remflux_scratch/16.12.05.canu_consensus/correction/1-overlapper/results/000001.ovb' not found

--Arthur

brianwalenz commented 7 years ago

You also need to remove the ovljob.files file that should be in 1-overlapper/.

However, I'd suggest forgetting about this run and starting over from scratch, with updated code. Three reasons: (1) There has been enough oddness in this run that I fear some unknown problem exists. (2) I don't know if recomputing some overlaps in the middle of building an overlap store will succeed -- it's never been done before. (3) You're code is slightly out of date with the latest, and is not compatible with the latest code -- meaning that you can't update your code and continue the assembly.

The two commits referenced above are explicitly checking for the error you encountered with overlaps, and will catch the error during the overlap phase.

Close the issue if you are restarting, as it's diverged significantly from when it was created.

WeichenZhou commented 7 years ago

Thank you so much for your reply and all your help. I will rerun by using Ver1.4. --Arthur

WeichenZhou commented 7 years ago

Since it is not closed, I have a question when running new version. I used the command:

useGrid=true gridEngine="pbs" gridOptionsJobName="canu_test1" gridEngineThreadsOption="-l nodes=1:ppn=THREADS:largemem" gridEngineMemoryOption="-l pmem=MEMORY" gridOptions="-l qos=flux,walltime=100:00:00 -m abe -V -d . -q fluxod -A remills_fluxod" 

It required 20 cores with 96gb each for the first step of program. However, our computing platform only allow 1TB overall memory requirement. Therefore, I changed the default command gridEngineThreadsOption="-l nodes=1:ppn=THREADS:largemem" to gridEngineThreadsOption="-l nodes=1:ppn=10:largemem". I am wondering whether it is right, or will make crash in the further running. Thank you so much for your time! --Arthur

skoren commented 7 years ago

What do you mean it requires 20cores with 96gb each? When canu reports the header like this:

-- Allowed to run under grid control, and use up to  16 compute threads and  251 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    4 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   94 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   32 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    3 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to  20 compute threads and   94 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   12 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   5 compute threads and   20 GB memory for stage 'minimap (overlapper)'.

That is the total memory for the job for all cores so it will request only 96gb total, not 96*20. The pmem parameter is per job not per core according to the documentation for PBSPro so it should be setting pmem=96g. Was it requesting more?

Generally, you should not change gridEngineThreadsOption to change how many cores you want to use. That is a flag telling Canu how to request threads so changing it breaks requests for all steps. You can customize memory/threads for individual steps if you'd like, try canu -options |grep -P "(grid|Memory|Threads)"

WeichenZhou commented 7 years ago

Thank you for your suggestion. I would say it required 96g each core because of pmem, so I will change it to mem and see how it works.

--Arthur

WeichenZhou commented 7 years ago

Should I reopen an Issue? or just continue this one? So after 8 days' running, my Canu 1.4 got stuck in cor_wash_U_canu_test2-134 process. I checked /correction/2-correction/correction_outputs, there are 130 .fasta files with 0bp size .dump.success and 0bp size .err, and six files looks like uncompleted: 0131.fasta.WORKING with0131.err, 0132.fasta.WORKING with0132.err, 0133.fasta.WORKING with0133.err, 0134.fasta.WORKING with0134.err. It looks like the jobs from 131 to 134 were killed since I only received job begun emails but no terminated emails from our computing clusters. I would appreciate if you may solve this or provide a way to resume these jobs. --Arthur

brianwalenz commented 7 years ago

Yes, please open a new issue. First, look in the error files and check canu-scripts/*out.