Failed to find the number of jobs (Meryl)

c4derpillar commented 6 years ago

Hi there,

Canu runs one of node on my HPC, however it fails when using it with the PBSPro grid, jobs seem to start and then Meryl errors regarding being unable to 'find the number of jobs'.

Any idea what is happening? There is a strange line in ecoli5.ms16.config.01.out:

"**Don't know what to do with '../../ecoli5.seqStore'.**"

  Example:  Find the highest count of each kmer present in both files, save the kmers to
            database 'maxCount'.

            meryl intersect-max input1 input2 output maxCount

  Example:  Find unique kmers common to both files.  Brackets are necessary
            on the first 'equal-to' command to prevent the second 'equal-to' from
            being used as an input to the first 'equal-to'.

            meryl intersect [equal-to 1 input1] equal-to 1 input2

**Don't know what to do with '../../ecoli5.seqStore'.**

My canu.out looks like this:

########################### Execution Started #############################
JobId:66073.flm1
UserName:taylorwass
GroupName:qris-uq
ExecutionHost:fl017
###############################################################################

Found perl:
   /bin/perl
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   /usr/java/latest/bin/java
   java version "1.8.0_101"

Found canu:
   /gpfs1/scratch/30days/taylorwass/nanopore/canu-1.8/Linux-amd64/bin/canu
   Canu 1.8

-- Canu 1.8
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from '/usr/java/latest/bin/java') with -d64 support.
-- Detected gnuplot version '4.6 patchlevel 2   ' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 504 gigabytes of memory.
-- Detecting PBSPro resources.
--
-- Found  33 hosts with  24 cores and  503 GB memory under PBSPro control.
-- Found   1 host  with  24 cores and 4473 GB memory under PBSPro control.
-- Found   1 host  with  24 cores and 2481 GB memory under PBSPro control.
-- Found   1 host  with  24 cores and  956 GB memory under PBSPro control.
-- Found   1 host  with  48 cores and  503 GB memory under PBSPro control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     12 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap    6 GB   12 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     4 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl     4 GB    8 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8 GB    1 CPU   (overlap store sorting)
-- Grid:  red        8 GB    4 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       16 GB    4 CPUs  (contig construction with bogart)
-- Grid:  gfa        8 GB    4 CPUs  (GFA alignment and processing)
--
-- Found Nanopore uncorrected reads in the input files.
--
-- Generating assembly 'ecoli5' in '/gpfs1/scratch/30days/taylorwass/nanopore/canu-1.8/Linux-amd64/ecoli5-oxford'
-- Parameters:
--
--  genomeSize        4800000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Sun Nov  4 22:24:49 2018 with 55232.884 GB free disk space

    cd .
    /gpfs1/scratch/30days/taylorwass/nanopore/canu-1.8/Linux-amd64/bin/sqStoreCreate \
      -o ./ecoli5.seqStore.BUILDING \
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Sun Nov  4 22:24:49 2018 with 55232.884 GB free disk space

    cd .
    /gpfs1/scratch/30days/taylorwass/nanopore/canu-1.8/Linux-amd64/bin/sqStoreCreate \
      -o ./ecoli5.seqStore.BUILDING \
      -minlength 1000 \
      ./ecoli5.seqStore.ssi \
    > ./ecoli5.seqStore.BUILDING.err 2>&1

-- Finished on Sun Nov  4 22:24:51 2018 (2 seconds) with 55232.818 GB free disk space
----------------------------------------
--
-- In sequence store './ecoli5.seqStore':
--   Found 20365 reads.
--   Found 140042151 bases (29.17 times coverage).
--   Read length histogram (one '*' equals 41.48 reads):
--     1000   1999    706 *****************
--     2000   2999   1682 ****************************************
--     3000   3999   1624 ***************************************
--     4000   4999   1543 *************************************
--     5000   5999   1905 *********************************************
--     6000   6999   2691 ****************************************************************
--     7000   7999   2904 **********************************************************************
--     8000   8999   2609 **************************************************************
--     9000   9999   1946 **********************************************
--    10000  10999   1280 ******************************
--    11000  11999    733 *****************
--    12000  12999    397 *********
--    13000  13999    181 ****
--    14000  14999    109 **
--    15000  15999     38
--    16000  16999      9
--    17000  17999      4
--    18000  18999      2
--    19000  19999      0
--    20000  20999      0
--    21000  21999      0
--    22000  22999      1
--    23000  23999      0
--    24000  24999      0
--    25000  25999      1
----------------------------------------
-- Starting command on Sun Nov  4 22:24:52 2018 with 55232.818 GB free disk space

    cd correction/0-mercounts
    ./meryl-configure.sh \
    > ./meryl-configure.err 2>&1

-- Finished on Sun Nov  4 22:24:52 2018 (fast as lightning) with 55232.818 GB free disk space
----------------------------------------
--  segments   memory batches
--  -------- -------- -------
--
--  For 20365 reads with 140042151 bases, limit to 1 batch.
--  Will count kmers using  jobs, each using  GB and 4 threads.
--
-- Finished stage 'merylConfigure', reset canuIteration.

ABORT:
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to find the number of jobs in 'correction/0-mercounts/meryl-count.sh'.
ABORT:
########################### Job Execution History #############################
JobId:66073.flm1
UserName:taylorwass
GroupName:qris-uq
JobName:canu_ecoli5
SessionId:18951
ResourcesRequested:mem=100gb,ncpus=1,place=free,walltime=24:00:00
ResourcesUsed:cpupercent=100,cput=00:00:05,mem=1708kb,ncpus=1,vmem=115256kb,walltime=00:00:06
QueueUsed:Short
AccountString:UQ-AIBN
ExitStatus:1

Thanks!

brianwalenz commented 6 years ago

I've attached a patched file that should fix this, but, unfortunately, I have no way to test. Uncompress it in src/pipelines/canu/, then 'make' in src/ to install it.

To restart, remove the correction/0-mercounts directory (there are only some shell scripts in there right now) and then rerun the same canu command.

Execution.pm.gz

manabanana commented 6 years ago

Hi! I have also gotten the same error as c4derpillar ("failed to find the number of jobs in 'correction/0-mercounts/meryl-count.sh'") with v1.8, but I don't get the error with v1.7.1 using the exact same canu commands. I tried to use the Execution.pm you provided above (thank you!), but unfortunately, when I try to install v1.8 from source I get the error "make: *** No rule to make target `install'.", and it creates the Linux-amd64/bin folder but it is empty; I don't get this error when I install v1.7.1 from source though. If this is an unrelated issue and should be listed as a new issue, please let me know.

brianwalenz commented 6 years ago

@manabanana does this occur when doing just 'make' - I think you're doing 'make install' which isn't needed (or supported). If it's still failing, yes, please do make a new issue. But if it works, and the patch works, hooray!

manabanana commented 6 years ago

@brianwalenz Unfortunately, the installation does not occur when I just use 'make' either. It creates the Linux-amd64/bin folder but the bin folder is empty. I will start a new issue.

c4derpillar commented 6 years ago

Thanks for your help Brian, it is now submitting the jobs, but still fails at a later stage of Meryl.

It is creating 2 output files now: meryl-count.jobSubmit-01.out ecoli.ms16.config.01.out

EDIT:

Looks like this is the issue now, this is after deleting the 'ecoli-oxford' folder every time I re-run the command. The meryl-count.sh file is there, and the config file has detected the working directory correctly:

########################### Execution Started #############################
JobId:66424.flm1
UserName:taylorwass
GroupName:qris-uq
ExecutionHost:fl018
###############################################################################
pbs_mom, exec of ./meryl-count.sh failed with error: No such file or directory
########################### Job Execution History #############################
JobId:66424.flm1
UserName:taylorwass
GroupName:qris-uq
JobName:meryl_ecoli
SessionId:18926
ResourcesRequested:mem=100gb,ncpus=4,place=free,walltime=20:00:00
ResourcesUsed:cpupercent=0,cput=00:00:00,mem=0kb,ncpus=4,vmem=0kb,walltime=00:00:02
QueueUsed:Short
AccountString:UQ-AIBN
ExitStatus:254
###############################################################################

As far as I can tell, everything is fine in ecoli.ms16.config.01.out, and a third job is started, however it ends after 2 seconds and I am unable to find its output/error files. After waiting an hour, no new files are appearing in canu-logs or in 0-mercounts.

Unfortunately our PBS setup does not allow for searching active/historical jobs by username, so it is a bit hard to keep track of what is running/what stage it is failing at.

Thank you!

cgjosephlee commented 5 years ago

@brianwalenz bf5a93b fixed the correction/0-mercounts/meryl-configure.sh, but broke other commands including the following correction/0-mercounts/meryl-count.sh and the master script canu-scripts/canu-01.sh. It failed to cd to working directory and

rm -f canu.out
ln -s canu-scripts/canu.01.out canu.out

created the empty link under my $HOME. I guess that the $PBS_ARRAY_INDEX cannot suit every command.

mmokrejs commented 5 years ago

I have same

ABORT:   failed to find the number of jobs in 'unitigging/0-mercounts/meryl-count.sh'.

problem with PBSpro here. I start the job from within the working directory so IMO the issue is elsewhere.

In the cwd I have many tt_16D1C3L12.ms22.config.*.out files, each ending with:

Don't know what to do with '../../tt_16D1C3L12.seqStore'.

The files contain just the general help text how to call meryl but NOT the actual (broken) command with the arguments, so I am blind.

Also, the ./unitigging/0-mercounts/meryl-count.sh file contains no value as an argument to memory=:

#  And compute.

/scratch/work/project/bio/canu-1.8/Linux-amd64/bin/meryl k=22 threads=8 memory= \
  count \
    segment=$jobid/ ../../tt_16D1C3L12.seqStore \
    output ./tt_16D1C3L12.$jobid.meryl.WORKING \
&& \
mv -f ./tt_16D1C3L12.$jobid.meryl.WORKING ./tt_16D1C3L12.$jobid.meryl

exit 0

I used gridEngine="pbspro" gridOptions="-A xx-xx -q qlong" gridEngineThreadsOption="-l select=1:ncpus=THREADS,walltime=144:00:00" useGrid=True as options to canu-1.8.

BTW, the documentation is insufficient. I allocated 10 nodes with 24 CPUs each. Canu top-level process properly recorded:

-- Detected 24 CPUs and 126 gigabytes of memory.
-- Detecting PBSPro resources.
-- 
-- Found   1 host  with   8 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  28 cores and  504 GB memory under PBSPro control.
-- Found  12 hosts with   8 cores and  247 GB memory under PBSPro control.
-- Found 1007 hosts with  24 cores and  125 GB memory under PBSPro control.

But should the command gridEngineThreadsOption define values to execute a "task" on a single node or on all those 10 nodes? How does that correlate with the resources actually assigned to my PBS job?

$ qstat -f 8864969.isrv5
...
    Resource_List.ncpus = 240
    Resource_List.nodect = 10
    Resource_List.select = 10:ncpus=24
    Resource_List.walltime = 144:00:00

spock commented 5 years ago

Same here. Traced the problem back to Meryl.pm, where in this line

    print STDERR "--  Will count kmers using $merylSegments jobs, each using $merylMemory GB and $thr threads.\n";

both $merylMemory and $merylSegments are still undefined. This leads to memory= without value when the meryl-count.sh is generated, probably from here:

        print F "$bin/meryl -C k=$merSize threads=$thr memory=$mem \\\n";

Will give 1.7 a try, thanks for mentioning that it works.

brianwalenz commented 5 years ago

Potentially fixed in e13467a0ada171b9f70dd9ea615452cd707ea0ac and df2dbf1df0fa8fd0a98a6f281dafcb131f57dc64.

marbl / canu

Failed to find the number of jobs (Meryl) #1138