Enquiry for useGrid=remote

zx0223winner commented 7 years ago

Hi canu supporting staff,

I turned to the Q&A tutorial of canu about 'Failed to submit batch jobs', but still have problems to run my program.

What I have done is to try to set grid either useGrid=false or useGrid=remote, then see what will happen. After trying both, it seems the cloud service could not support the way by setting useGrid=false, so I have to choose the other to submit the jobs manually like the following report:

>>>
_Please run the following commands to submit jobs to the grid for execution using 5 gigabytes memory and 4 threads:

  /gwork/xzha25/ecoli-auto/unitigging/0-mercounts/meryl.jobSubmit.sh

When all jobs complete, restart canu as before._
>>>>

So I do it as followed with the command line: sqsub -r 30m -q threaded --mpp 5g --tpp 4 -o outfile4 sh /gwork/xzha25/ecoli-auto/correction/1-overlapper/precompute.jobSubmit.sh

I find just nothing happened, I am wondering if I was right to execute the meryl.jobSubmit.sh by putting sh before the path. I just referred to the google search.

Will Thank for debugging!

skoren commented 7 years ago

The shell script itself contains the command to submit the job. Given that your grid engine isn't supported (Canu doesn't use/know about sqsub) I am surprised Canu even tried to use the grid or recognized how to run jobs. What command-line did you use to run Canu?

I assume that auto-generated commands are not going to work. You would have to edit the shell script by hand to make it a valid submit command for your grid, then

cd /gwork/xzha25/ecoli-auto/correction/1-overlapper/
sh precompute.jobSubmit.sh

which will submit the jobs using your updated commands. Looking at online documentation, I don't see a way to submit array jobs which Canu uses extensively so you'd have to modify any jobs that have a range to be individual jobs for every step. You're better off running on a single large node and using useGrid=false or using one of the supported engines if you have access (SGE, Slurm, PBS, Torque).

zx0223winner commented 7 years ago

OK Thanks for your suggestions and I will try and keep you posted about the results later on. I use this command line : canu -p ecoli -d ecoli-auto genomeSize=4.8m useGrid=remote -pacbio-raw ecoli_p6_25x.filtered.fastq

skoren commented 7 years ago

What does the Canu preamble report regarding resources/grid engine?

zx0223winner commented 7 years ago

outfile3.txt

zx0223winner commented 7 years ago

outfile4.txt Another one looks like this:

-- Starting command on Fri Jun 9 16:00:44 2017 with 80803.74 GB free disk space

cd /gwork/xzha25/ecoli-auto
qsub \
  -j oe \
  -d `pwd` \
  -W depend=afteranyarray:10307788[].orc-admin2.orca.sharcnet \
  -l mem=8g \
  -l nodes=1:ppn=1   \
  -N 'canu_ecoli' \
  -o canu-scripts/canu.01.out canu-scripts/canu.01.sh

10307789.orc-admin2.orca.sharcnet

It seems the command file (canu.01.sh) have not been executed. It means fail to submit batch jobs. So I guess it is due to the grid.

skoren commented 7 years ago

Based on this you also have a PBS grid engine available, does your system support PBS emulation? The failed job is the first use of java so I'm going to guess your JVM is not properly configured, what is the output in the precompute.000001.out file.

zx0223winner commented 7 years ago

precompute.000001.out.txt

Running job 1 based on command line options. Dumping reads from 1 to 9000 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM Could not reserve enough space for 6291456KB object heap Mhap failed.

So what does this mean?

skoren commented 7 years ago

Your second log is the same as the one you posted previously in issue #520 that shows no errors. Have you actually checked if the jobs are running on your grid. There should be two jobs, one with ID 10307789 and a second with ID 10307788. The job should write to canu-scripts/canu.01.out so post that file.

zx0223winner commented 7 years ago

It is empty in the file with name /canu-scripts.

skoren commented 7 years ago

Are you running these two at the same time? If you run in the same folder at the same time they will collide and cause errors. You also seem to have two Canu versions installed at the same time, stick to only one and run one job at a time either on grid or off grid. Also, I don't want to track which is which in one issue so ignoring your grid run (outfile4.txt) because it had no errors, your local run is failing to reserve 6gb of ram. The machine you are running it on has 126gb of ram so there should certainly be 6gb available.

Are you reserving the 6gb of ram on this machine? Does the machine have 6gb ram free when you run Canu?

This is most likely a JVM/machine configuration issue not a Canu issue.

zx0223winner commented 7 years ago

OK, Thanks. I will do as your said and try it again. Will keep you posted if making any progress.

skoren commented 7 years ago

You need to start from scratch. To get a clean run, remove the current run folder completely (/gwork/xzha25/ecoli-auto), run qsub and post the output to confirm you do not have jobs pending execution on your grid. Otherwise you will keep having issues if there are straggler jobs trying to run and compete with a local job and I can't provide you help if you switch between useGrid=false/useGrid=true/useGrid=remote all in one issue without waiting to complete one run successfully.

Looking through your outfile3.txt, it looks like you requested only 4gb of ram for the job but did not tell Canu to use 4gb of ram:

              job id: 10312644
         exit status: 1
            cpu time: 1s / 900s (0 %)
        elapsed time: 14s / 900s (1 %)
      virtual memory: 587.6M / 4.0G (14 %)

Once you have done the above to start from scratch, run the same command you did to generate outfile3.txt but this time set maxMemory=4.

zx0223winner commented 7 years ago

I do have two versions canu running, now I have deleted them and install one and set the new PATH.

I do run program in /work, now I have changed to /scratch directory.

I do have some jobs queued like this, but I don't know if it matters. And how to kill them all. screen shot 2017-06-13 at 4 51 47 pm

zx0223winner commented 7 years ago

I use the command : sqkill -a It shows me like this:

[xzha25@orc130 xzha25]$ sqkill -a
qdel: Unknown Job Id 10301874.orc-admin2.orca.sharcnet

skoren commented 7 years ago

Yes, you need to kill those jobs, you can see the jobs from your outfile4.txt run which are all waiting in Q, so that shows, as I said, there is no error in that run, Canu properly submitted jobs to your grid but they are not getting scheduled. Why they aren't getting scheduled is a question for your sysadmin, there may be required parameters or flags to submit a job you have to pass to Canu.

As for sqkill, again that's your sysadmin/grid environment, I haven't ever used your grid engine so I can only guess based on online docs. I would try sqkill -q and sqkill -r. As long as you get an empty list of jobs (or all in D) from sqjobs then you're OK. Now, the jobs like 10313091 that look like you submitted Canu yourself, what was the command you used for those? We can use that as a template to start a new run on a single machine (with useGrid=false).

zx0223winner commented 7 years ago

This is my job submitting command:

sqsub -r 30m -q threaded --mpp 16g --tpp 16 -o outfile  canu -p ecoli -d ecoli-auto maxMemory=4 genomeSize=4.8m  useGrid=false -pacbio-raw ecoli_p6_25x.filtered.fastq

"there may be required parameters or flags to submit a job you have to pass to Canu" Can you detail this?

skoren commented 7 years ago

Based on the docs I found here: https://www.sharcnet.ca/help/index.php/Sqsub

you probably want:

sqsub -r 3h -q threaded -n 16 --mpp 16G --tpp 16 -o outfile canu -p ecoli -d ecoli-auto maxMemory=8 maxThreads=16 genomeSize=4.8m useGrid=false -pacbio-raw ecoli_p6_25x.filtered.fastq

zx0223winner commented 7 years ago

Ok, hopefully it works, will keep you posted later on. Now I have to figure out how to empty the job list.

zx0223winner commented 7 years ago

It looks good this time, but none of the assembly result was found yet. May be due to the job lists have not been empty or all in D. outfile.txt

skoren commented 7 years ago

I am not sure why there is an error in the beginning of the file:

-- Canu release v1.5
-- Detected Java(TM) Runtime Environment '1.8.0_73' (from 'java').
-- Detected gnuplot version '4.2 patchlevel 6 ' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 31 gigabytes of memory.
-- Limited to 4 gigabytes from maxMemory option.
-- Detected PBS/Torque '' with 'pbsnodes' binary in /opt/sharcnet/torque/2.5.13/bin/pbsnodes.
-- Grid engine disabled per useGrid=false option.
--
-- DEBUG
-- DEBUG  Limited to 4 GB memory via maxMemory option
-- DEBUG
-- DEBUG Have 1 configurations; largest memory size 4 GB; most cores 24:
-- DEBUG   class0 - 1 machines with 24 cores with 4GB memory each.
-- DEBUG
--
-- Task cor can't run on any available machines.
-- It is requesting 6-16 GB memory and 1-2 threads.
-- See above for hardware limits.
--
================================================================================
Don't panic, but a mostly harmless error occurred and Canu stopped.

but the rest of the run looks OK and has been running for about 10 minutes as of 5:57pm based on your log so it's just not done yet:

-- Starting concurrent execution on Tue Jun 13 17:57:57 2017 with 157877.926 GB free disk space (3 processes; 8 concurrently)

    cd correction/2-correction
    ./correctReads.sh 1 > ./correctReads.000001.out 2>&1
    ./correctReads.sh 2 > ./correctReads.000002.out 2>&1
    ./correctReads.sh 3 > ./correctReads.000003.out 2>&1

The job should not be in D state but in R

zx0223winner commented 7 years ago

OK,make sense. I should wait longer until submit the next command. The error you found, I guess it is due to I submit another job with same outfile.

skoren commented 7 years ago

There is no next command, this will run the full assembly and generate the final outputs. You should not run any other commands until this job is in the D state or not visible in the job scheduler otherwise you will corrupt this run.

zx0223winner commented 7 years ago

Ok, gotcha. By the way, If I have several input fastq files, how can I use canu to assembly them all at once instead of by using the command line to run every single fastq file separately.

skoren commented 7 years ago

You can use wildcards (-pacbio-raw *.fastq) or specify multiple -pacbio-raw options. You should not run each file separately as that will not use all the evidence for correction.

zx0223winner commented 7 years ago

Ok, make senses. Thanks, will let you know if I figure out this.

skoren commented 7 years ago

Your runs on a single machine are working and jobs were always being submitted (if you had errors due to failure to submit jobs, your logs would be reporting the error "failed to submit batch jobs") so you should have no need to use gridOptions=remote. If you wanted to run a larger genome following this example, you can increase the memory request and tell Canu about the limits:

sqsub -r 3h -q threaded -n 16 --mpp 120G --tpp 16 -o outfile canu -p ecoli -d ecoli-auto maxMemory=112G maxThreads=16 genomeSize=4.8m useGrid=false -pacbio-raw ecoli_p6_25x.filtered.fastq

That should work for medium (100Mb or so) genomes.

zx0223winner commented 7 years ago

Thanks, Skoren. This time the output report looks more beautiful than before. outfile.keystep.txt

But stopped with a failure, I think I am so close.

Don't panic, but a mostly harmless error occurred and Canu stopped.

Canu release v1.5 failed with: can't open 'unitigging/5-consensus/consensus.sh' for writing: Text file busy

brianwalenz commented 7 years ago

Tip: google any and all error messages. It's quite likely someone else had the same problem. The first hit is to: https://stackoverflow.com/questions/16764946/what-generates-the-text-file-busy-message-in-unix which says (in part) "Text file busy error in specific is about trying to modify an executable while it is executing." So, it seems you still have multiple assemblies running in the same place.

zx0223winner commented 7 years ago

Thanks for the Tips, I am still working on it. It will take a while reach the result though and I will let you know later on if it works.

zx0223winner commented 7 years ago

Thanks skorken and brianwalenz, after terminating the other queued assemblies, canu is running very well.

Now I have a 250m genome on 40g PacBIo raw data, I am seeking for your advice on the recommended parameters to submit the job. I searched the other issues on Canu Github, But the cloud computing cluster which I used only have 24 cores but have unlimited memory. How to set the maxMemory and maxThreads, mpp as followed?

_sqsub -r 5d -q threaded -n 24 --mpp 16G --tpp 24 -o outfile canu -p ecoli -d /orclfs/scratch/xzha25/ecoli-auto maxMemory=8 maxThreads=24 genomeSize=250m useGrid=false -pacbio-raw *.fastq

Attached is the computing cluster screen shot 2017-06-18 at 12 50 36 pm

skoren commented 7 years ago

The info you pasted above seems to indicate the compute nodes have 24 cores and 32gb of ram. In that case you need to set --mpp 32G maxMemory=32, otherwise the command would be the same. I am not sure what you mean unlimited memory.

As for running on the grid like you were trying to do earlier, your cluster uses sqsub/sqbatch/etc. As far as I can based on searches, this is a one off only used on your system (assuming this is your system: https://www.sharcnet.ca/help/index.php/Main_Page). Canu doesn't know how to use sqsub/etc. The sqsub command also seems to lack task support as documented which means array jobs (required for Canu) aren't supported so we're not going to support those commands from Canu natively.

From your previous runs it looks like the grid also has PBS/Torque commands/emulation and Canu was able to submit jobs though they ended up in the queue forever. This could be because it is submitting multi-core jobs to the default (serial) queue. It's possible if the jobs were submitted to the threaded queue it would work. This might be as simple specifying the queue (gridOptions="-q threaded") or there might be something more required to use PBS/Torque. The best option to use PBS/Torque would be to as your cluster IT group whether this PBS/Torque emulation is officially supported and what minimum job options must be specified for a job to be scheduled. Alternatively, if you have access to this cluster: https://www.sharcnet.ca/help/index.php/Graham, it supports Slurm which Canu natively recognizes and supports.

zx0223winner commented 7 years ago

Thanks for your detailed reply. canu works well with threaded queue, But the job is queued for long time usually, I guess its due to I request too much resources which slow down my priority to run canu on the cloud computing.

marbl / canu

Enquiry for useGrid=remote #528