Inqury on useGrid=remote

ShuChen1986 commented 5 years ago

Hi, I was running canu with useGrid=remote as following

/public/home/test3/app/canu-1.8/Linux-amd64/bin/canu -correct \
-p 0803 -d canu0803 \
genomeSize=1.2g \
useGrid=remote \
-nanopore-raw /public/home/test3/scsio/Nanopore/Data/N_all_filt.fq

It stopped at running meryl-count.sh, so I ran meryl-count.sh manually, and then it stopped with the following feedback. /public/home/test3/scsio/Nanopore/canu0803/correction/0-mercounts/meryl-count.sh: line 105: 146199 Killed /public/home/test3/app/canu-1.8/Linux-amd64/bin/meryl k=16 threads=7 memory=17 count segment=$jobid/01 ../../0803.seqStore output ./0803.$jobid.meryl.WORKING

Then, I modified the above command and ran as following: /public/home/test3/app/canu-1.8/Linux-amd64/bin/meryl k=16 threads=28 memory=120 count segment=01/01 ../../0803.seqStore output ./0803.01.meryl.WORKING

It finished with no erro info and I did not know what to run next.

Writing results to './0803.01.meryl.WORKING', using 28 threads.
  wPrefix  10
  wSuffix  22
  nPrefix  1024
  nSuffix  4194304
  sMask    0x00000000003fffff

finishIteration()--
Bye.

So I ran the first canu commandline again, and it gave me the following feedback.

-- Running jobs.  Second attempt out of 2.
----------------------------------------
-- Starting 'meryl' concurrent execution on Thu Aug  8 10:32:07 2019 with 389303.421 GB free disk space (1 processes; 1 concurrently)

    cd correction/0-mercounts
    ./meryl-count.sh 1 > ./meryl-count.000001.out 2>&1

-- Finished on Thu Aug  8 10:32:08 2019 (one second) with 389303.421 GB free disk space
----------------------------------------
--
-- Kmer counting (meryl-count) jobs failed, tried 2 times, giving up.
--   job 0803.01.meryl FAILED.
--

ABORT:
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

Could you please give me some suggestions on this? Thank you so much!

brianwalenz commented 5 years ago

Quick answer: I'm guessing you didn't rename 0803.01.meryl.WORKING to 0803.01.meryl as the last line of the script does.

Longer answer:

In general, you need to run the scripts as-is, no cheating and running the commands by hand. The scripts will get much more complicated.

'Killed' usually indicates the job was killed for exceeding a memory limit imposed by the grid. The original command should have used no more than 17GB memory. How much did you request for the job? How much did canu request (in the JobSubmit script)?

The second time you ran it, you told the command itself to use up to 120GB memory (and more threads), but however you ran this command, it wasn't killed for exceeding memory limits.

ShuChen1986 commented 5 years ago

Hi, Brian, Thank you very much for your reply. I learnt from the documentation that canu will automatically detect and compute the resources, so I did not set up the number of threads and memory in the canu command. When I qsub the job, I did not set up memory size, the maximum memory for each node is 125gb.

#PBS -N canu0803
#PBS -l nodes=1:ppn=28
#PBS -q high

Should I set up the request for threads and memory at the canu command and run it from the start?

brianwalenz commented 5 years ago

Yes, it should all be automagic. The JobSubmit scripts will request memory and thread resources via command line options. Do those look appropriate for your grid? If not, you can change them with the gridEngineResourceOption option. The Default is:

gridEngineResourceOption="-l nodes=1:ppn=THREADS:mem=MEMORY"

Or set options applied to all grid jobs with gridOptions. You probably need to set:

gridOptions="-q high"

Finally, and very Important, upgrade to the almost-ready-to-be-released v1.9. This has some extremely important fixes for PBS.

> git clone https://github.com/marbl/canu.git
> cd canu/src
> git checkout v1.9
> make -j 8

ShuChen1986 commented 5 years ago

Since the supercomputer I am using won't allow me to connect to internet, I will have to make v1.9 in my local computer and upload it to the supercomputer. Hopefully, it will work. The new canu -version feedback as "Canu snapshot v1.8 +299 changes (r9509 8e0c3e911f1af984f0153550eb0faea2379ffa36)" instead of v1.9. Did I get the right version?

brianwalenz commented 5 years ago

That's the correct version. The number doesn't change until I actually make the release. :-(

If you have trouble compiling, you can just tar up the canu source code directory (canu/ in my example), upload that, and compile on the remote machine. Once you've done 'git clone', internet access isn't necessary.

ShuChen1986 commented 5 years ago

That's the correct version. The number doesn't change until I actually make the release. :-(

If you have trouble compiling, you can just tar up the canu source code directory (canu/ in my example), upload that, and compile on the remote machine. Once you've done 'git clone', internet access isn't necessary.

I've tried to use v1.9, but it needs glibc2.14+, and I do not have the root authority of the server I am using. I've tried to make install glibc2.14+ in directories other than root, but it failed me with core dump. Luckily now the v1.8 is working with useGrid=false in a acceptable speed. Strangely, I saw useGrid=false was rejected while submitting to the grid, but it is now finishing up mhap step. I have a different question here though, I am assembling a plant genome of 1.2Gb, I have 30Gb Pacbio RSII reads, and 140Gb Nanopore reads, would you recommend to correct the all the reads together or correct the Pacbio and Nanopore reads separately.

skoren commented 5 years ago

There is no specific need for a glibc within Canu, the compiler/OS will determine that, if you're getting glibc errors it likely means you have a different version of the environment where you compile and where you run the code. You could try building on your local machine in a virtual environment with an old OS to avoid this (see https://pmelsted.wordpress.com/2015/10/14/building-binaries-for-bioinformatics/) but since you got 1.8 to work then you can not worry about this.

We typically correct all the reads together, given how much coverage you have of nanopore vs pacbio you could just use the nanopore alone.

ShuChen1986 commented 5 years ago

There is no specific need for a glibc within Canu, the compiler/OS will determine that, if you're getting glibc errors it likely means you have a different version of the environment where you compile and where you run the code. You could try building on your local machine in a virtual environment with an old OS to avoid this (see https://pmelsted.wordpress.com/2015/10/14/building-binaries-for-bioinformatics/) but since you got 1.8 to work then you can not worry about this.

We typically correct all the reads together, given how much coverage you have of nanopore vs pacbio you could just use the nanopore alone.

Thank you for your reply. Indeed, after I compiled v1.9 in the server (with a make: warning: Clock skew detected. Your build may be incomplete), the glibc error went away. But when I submitted the job without useGrid=false, it still gave the error info as following:

CRASH: Canu snapshot v1.8 +299 changes (r9509 8e0c3e911f1af984f0153550eb0faea2379ffa36)
CRASH: Please panic, this is abnormal.
ABORT:
CRASH:   Failed to submit compute jobs.
CRASH:
CRASH: Failed at /public/home/test3/app/canu-1.9/Linux-amd64/bin/../lib/site_perl/canu/Execution.pm line 1241.
CRASH:  canu::Execution::submitOrRunParallelJob("ecoli", "meryl", "correction/0-mercounts", "meryl-count", 1) called at /public/home/test3/app/canu-1.9/Linux-amd64/bin/../lib/site_perl/canu/Meryl.pm line 828
CRASH:  canu::Meryl::merylCountCheck("ecoli", "cor") called at /public/home/test3/app/canu-1.9/Linux-amd64/bin/canu line 859
CRASH: 
CRASH: Last 50 lines of the relevant log file (correction/0-mercounts/meryl-count.jobSubmit-01.out):
CRASH:
CRASH: qsub: submit error (Bad UID for job execution MSG=ruserok failed validating test3/test3 from node69)

ShuChen1986 commented 5 years ago

Also, when I submitted jobs using useGrid=false, the "exec_host" will still show node number, does this mean the job is still executed on a node not the local machine? Or as long as the job was run on one same node, useGrid=false will work fine?

skoren commented 5 years ago

That error indicates your nodes aren't allowed to submit jobs themselves, this is a requirement for running Canu on the grid (see the FAQ). useGrid=false is the suggested workaround for grids that don't support this feature. See issue #104 for information on how to configure the Torque server to fix the error, assuming your admin allows this.

useGrid=false is meant to be run on the head node, not sure if it will work correctly on a compute node. It won't do heavy compute though and just update bookkeeping, check for job success, etc.

marbl / canu

Inqury on useGrid=remote #1437