noor-albader commented 4 years ago

Not sure how to overcome a unitigging fail: Are there intermediate files I can remove to restart canu? Is there a step I have to run manually beforehand to restart canu?

Using canu -version 1.8 running on a grid My canu command:

#SBATCH --time=240:10:00
module load canu/1.8/gnu6.4.0;
time canu -d palm_canu -p palm genomeSize=1000m -pacbio-raw /home/albadenm/c2042/data/palm/assembly_ready/palm.PB.fa.gz usegrid=1 gridOptions="--time=5-00:00:00 --partition=batch --mem-per-cpu=16g" gridOptionsJobName=palm-using-grid
echo "canu is done!"

My canu.out log with crash report:

Found perl:
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   openjdk version "1.8.0_212"

Found canu:
   Canu 1.8

-- Detected Java(TM) Runtime Environment '1.8.0_212' (from 'java') with -d64 support.
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- Detected 40 CPUs and 377 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /opt/slurm/cluster/ibex/install/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1048575 jobs.
-- Found  10 hosts with  20 cores and  246 GB memory under Slurm control.
-- Found   2 hosts with  64 cores and 2010 GB memory under Slurm control.
-- Found   2 hosts with  16 cores and  246 GB memory under Slurm control.
-- Found   4 hosts with  32 cores and 3007 GB memory under Slurm control.
-- Found   6 hosts with  64 cores and  990 GB memory under Slurm control.
-- Found   1 host  with  16 cores and  246 GB memory under Slurm control.
-- Found 154 hosts with  20 cores and  118 GB memory under Slurm control.
-- Found   3 hosts with  64 cores and 1506 GB memory under Slurm control.
-- Found  16 hosts with  36 cores and  246 GB memory under Slurm control.
-- Found   1 host  with  64 cores and 1375 GB memory under Slurm control.
-- Found  74 hosts with  16 cores and   54 GB memory under Slurm control.
-- Found   2 hosts with  16 cores and  120 GB memory under Slurm control.
-- Found  22 hosts with  64 cores and  246 GB memory under Slurm control.
-- Found  14 hosts with  48 cores and 3007 GB memory under Slurm control.
-- Found 108 hosts with  40 cores and  366 GB memory under Slurm control.
-- Found   1 host  with  64 cores and  498 GB memory under Slurm control.
-- Found   1 host  with  80 cores and 1503 GB memory under Slurm control.
-- Found   1 host  with  80 cores and 2010 GB memory under Slurm control.
-- Found  16 hosts with  16 cores and  118 GB memory under Slurm control.
-- Found   1 host  with  16 cores and   50 GB memory under Slurm control.
-- Found   3 hosts with  64 cores and  750 GB memory under Slurm control.
-- Found  16 hosts with  32 cores and  366 GB memory under Slurm control.
-- Found   2 hosts with  64 cores and  988 GB memory under Slurm control.
-- Found  30 hosts with  48 cores and  745 GB memory under Slurm control.
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     25 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16 GB   16 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   16 GB    4 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16 GB    4 CPUs  (overlap detection)
-- Grid:  utgovl    16 GB    4 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32 GB    1 CPU   (overlap store sorting)
-- Grid:  red       12 GB    4 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256 GB   16 CPUs  (contig construction with bogart)
-- Grid:  gfa       16 GB   16 CPUs  (GFA alignment and processing)
-- In 'palm.seqStore', found PacBio reads:
--   Raw:        4967706
--   Corrected:  947614
--   Trimmed:    915230
-- Generating assembly 'palm' in '/ibex/scratch/projects/c2042/analysis/genome_assembly_palm/palm_canu/palm_canu'
-- Parameters:
--  genomeSize        1000000000
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
-- Creating overlap store unitigging/palm.ovlStore using:
--      2 buckets
--      2 slices
--        using at most 16 GB memory each
-- Overlap store bucketizer jobs failed, tried 2 times, giving up.
--   job unitigging/palm.ovlStore.BUILDING/bucket0002 FAILED.

ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.

My unitigging/ovlStore.BUILDING/logs report:

Found perl:
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   openjdk version "1.8.0_212"

Found canu:
   Canu 1.8

Running job 2 based on SLURM_ARRAY_TASK_ID=2 and offset=0.

Attempting to increase maximum allowed processes and open files.
  Max processes per user limited to 1542347, no increase possible.
  Max open files limited to 131072, no increase possible.

Overwriting incomplete result from presumed crashed job in directory '.

Opened '../palm.seqStore' with 4967706 reads.

Constructing slice 2 for store './palm.ovlStore.BUILDING'.
 - Filtering overlaps over 1.0000 fraction error.

Bucketizing input    1 out of   99 - '1-overlapper/001/000002.ovb'
Bucketizing input    2 out of   99 - '1-overlapper/001/000004.ovb'
Bucketizing input    3 out of   99 - '1-overlapper/001/000006.ovb'
ERROR: short read on file '1-overlapper/001/000006': read 0 bytes, expected 13715.
skoren commented 4 years ago

You have disk corruption in one of the files. You'll have to go back and re-run the corrupted files. Remove the palm.ovlStore.BUILDING/ folder, cd to the 1-overlapper folder, remove 001/000006.* files, run sh 6 and re-run canu command you used before from the top level. It may fail again if more than one file is corrupt in which case you'd have to follow the same steps above for that file as well. Make sure you're not out of space as well on your system.

noor-albader commented 4 years ago

Hi I was able to remove the following and palm.ovlStore.BUILDING/ folder and 1-overlapper/001/000006.*

Not sure what you mean by sh 6 ? not sure how to run since I can't find it in the canu module. Also where would I run the script?


skoren commented 4 years ago

As I said in the initial reply, you have to go into the overlapper folder to run it. The script is generated by Canu already:

cd unitigging/1-overlapper
sh 6
cd ../../../
<re-run initial canu command assuming above is successful>

You should do this on a note with at least 16gb of memory reserved as that is what your overlap jobs expect to have available.

noor-albader commented 4 years ago

Thank you for your reply! But I do not see an in my unitigging/1-overlapper

Here are the contents of my unitigging/1-overlapper:

Is there another way to run sh 6or download that script from somewhere?

skoren commented 4 years ago

The script is listed in your output:

-rwxr-xr-x 1 albadenm ibex-c2042 27000 Jan 10 18:23

It's created at runtime by canu so there's nowhere to download it from but you should be able to run it.

noor-albader commented 4 years ago

Thank you for your quick responses! Super useful!

After removing the palm.ovlStore.BUILDING/ folder and 1-overlapper/001/000006.*

I was able to perform with no error the following:

cd unitigging/1-overlapper
sh 6

and three new files (1-overlapper/001/000006.*) were created, one of which was labeled stats:

head  000006.stats
 Kmer hits without olaps = 74114865
 Kmer hits with olaps = 303985
 Multiple overlaps/pair = 0
 Total overlaps produced = 303985
      Contained overlaps = 81441
       Dovetail overlaps = 222544
Rejected by short window = 0
Rejected by long window = 0

But when trying to rerun the original canu command:

cd ../../../
<re-run initial canu command assuming above is successful>

I get the same error in my unitigging/ovlStore.BUILDING/logs report and canu.out in my original post.

Should I try re-running Canu from scratch? It doesn't seem like I can overcome the following error: ERROR: short read on file '1-overlapper/001/000006': read 0 bytes, expected 13715

skoren commented 4 years ago

What are the sizes of the 1-overlapper/001/000006* files? Have you confirmed you're not running out of disk/quota space?

noor-albader commented 4 years ago

Here are my 1-overlapper/001/000006* (also 000005* and 000004* for reference ):

-rw-r--r-- 1 albadenm ibex-c2042 19870840 Jan 10 20:25 000004.oc
-rw-r--r-- 1 albadenm ibex-c2042  4046954 Jan 10 20:25 000004.ovb
-rw-r--r-- 1 albadenm ibex-c2042      258 Jan 10 20:25 000004.stats
-rw-r--r-- 1 albadenm ibex-c2042 19870840 Jan 10 21:03 000005.oc
-rw-r--r-- 1 albadenm ibex-c2042  5612616 Jan 10 21:03 000005.ovb
-rw-r--r-- 1 albadenm ibex-c2042      258 Jan 10 21:03 000005.stats
-rw-r--r-- 1 albadenm ibex-c2042 19870840 Jan 21 06:17 000006.oc
-rw-r--r-- 1 albadenm ibex-c2042  6009271 Jan 21 06:17 000006.ovb
-rw-r--r-- 1 albadenm ibex-c2042      258 Jan 21 06:17 000006.stats

human readable file sizes:

du -h 
5.8M    000006.ovb
19M 000006.oc

I have got over 30T left so its not running out of disk/quota space..

skoren commented 4 years ago

Very strange, it seems like the file is truncated but size looks OK and I assume you saw no errors when you re-ran it. Are you able to share this data, see the FAQ for instructions to send it to us. We'd need the offending overlap files unitigging/1-overlapper/001/000002/4/6* along with unitigging/1-overlapper/overlap.8265962_6.out and any sh files in unitigging/ovlStore.BUILDING/scripts folder.

noor-albader commented 4 years ago

There is no unitigging/1-overlapper/001/000002/4/6* The directory only goes up to unitigging/1-overlapper/001/

I can send the unitigging/1-overlapper/001/000006* files, along with unitigging/1-overlapper/overlap.8265962_6.out

skoren commented 4 years ago

I mean all the overlapping output files needed for that bucket (unitigging/1-overlapper/001/000002*; unitigging/1-overlapper/001/000004*; unitigging/1-overlapper/001/000006*). The shell scripts in the ovlStore.BUILDING folder are also important as they capture how the bucket is being created on your system.

noor-albader commented 4 years ago

Sorry there is a time difference between us. When I sent my last comment (~24hrs ago), I shared, using the FAQ instructions, the following data and hope you have received them : unitigging/1-overlapper/001/000006* unitigging/1-overlapper/overlap.8265962_6.out

Would you like me to also send the over all the files in unitigging/1-overlapper/001/* /this bucket (all even numbers)?

skoren commented 4 years ago

Please send only the unitigging/1-overlapper/001/000002* and unitigging/1-overlapper/001/000004* files, those are the only ones needed in addition to unitigging/1-overlapper/001/000006. Also send over all files in the folder unitigging/ovlStore.BUILDING/scripts

noor-albader commented 4 years ago

Sorry for the late reply!

Ok I have already sent (but will resend, just in case!): 1-overlapper/001/000006*

Additionally, I will now send the: unitigging/1-overlapper/001/000002* unitigging/1-overlapper/001/000004* unitigging/ovlStore.BUILDING/scripts

noor-albader commented 4 years ago

Just kidding, The ftp protocol would not let me place the 1-overlapper/001/000006* in you folder again.

All of the other files I was about to transfer to you.

Thank you

skoren commented 4 years ago

Thanks, I got the files. I'm going to look at the overlaps in each but in the meantime, the output of the job with the issue seems like multiple concurrent instances were running at once:

Starting 1-242523 with 7579 per thread

Thread 00 processes reads 1-7579


Thread 00 writes    reads 1-7579 (10993 overlaps 10993/2530953/0 kmer hits with/without overlap/skipped)
Thread 00 processes reads 30317-37895


mv: cannot stat ‘./001/000006.ovb.WORKING’: No such file or directory

Concurrent jobs could definitely cause an issue (the handling for this condition got improved in 1.9), do any of your other output files have a similar error message? Is it possible this job got run twice (by hand and by Canu) at the same time?

noor-albader commented 4 years ago

yes I have seen this error after I tried running by hand and not realizing it failed and then running canu: './001/000006.ovb.WORKING’: No such file or directory

Is it possible this job got run twice (by hand and by Canu) at the same time? yes once; but this error only popped up after I got my original error in my first post: job unitigging/palm.ovlStore.BUILDING/bucket0002 FAILED

In the mean time, I have started a new instance of assembly. Hopefully I won't get the same (original error)

skoren commented 4 years ago

Any updates?

skoren commented 4 years ago

Closing, idle. Not able to reproduce locally (the users file are corrupt but we haven't seen this corruption locally) and seems like it may be a collision of same jobs running simultaneously. There was a fix post canu 1.8 to resolve this race condition (which may be due to some slurm versions not holding jobs properly).