qsub, -t option - Githubissues

TonyMane commented 6 years ago

Greetings, I am using Canu 1.6, on a grid (PBS). I am trying to assemble a metagenome. I am not entirely sure if this is a canu specific question, or a question that requires some insight into PBS job submission. Regardless, i continually receive a qsub submission error during the assembly process in the file 'red.jobSubmit-02.out' located in the unitigging/unitigging/3-overlapErrorAdjustment/..

'qsub: submit error (Maximum number of jobs already in queue MSG=Job 3382993.dedicated-sched.pace.gatech.edu violates the global server limit of 3000 jobs queued per user)'

So, i added 'gridEngineArrayMaxJobs=2000' to the pbs script, and then continued the canu pipeline, but started at the assembly stage. it looks like the same process was encountered, I did notice that that the error was in 'red.jobSubmit-03.out'. Also, I noticed in the 'canu.out' file, that there were two separate files started:

'-- 'red.jobSubmit-01.sh' -> job 3386904[].dedicated-sched.pace.gatech.edu tasks 1-2000.
-- 'red.jobSubmit-02.sh' -> job 3386905[].dedicated-sched.pace.gatech.edu tasks 2001-4000.

CRASH:
CRASH: Canu snapshot v1.6 +61 changes (r8473 86f53cff1401ce4229d2a579ed093afe68751e0a)
CRASH: Please panic, this is abnormal.I was under the impression that the -t option provided a maximum number of submissions, that were then distributed equally.'

I am wondering if anyone has had a similar issue, and which parameter they modified.

thanks!

skoren commented 6 years ago

The issue is that canu always submits all its array jobs at once but it that exceeds the maximum jobs in a queue on your grid. There isn't an easy way to work around this within Canu (see #461). The best option would be to just increase both red and oea memory (redMemory=32 oeaMemory=32) which will decrease the number of jobs. Remove the unitigging/unitigging/3-overlapErrorAdjustment folder first.

On another note, you're running a tip release from when we have some large-scale change and may not be stable. I'd suggest using either the release or the latest from tip as of now. However, you'd have to start the assembly from scratch because switching versions in the middle of the assembly isn't supported.

TonyMane commented 6 years ago

Greetings, I re-installed canu, per your initial suggestion, and then re-started the assembly. The program aborted/crashed, again, with a memory error in the unitigging/4-unitigger/unitigger.err

==> PARAMETERS.

Resources:
  Memory                32 GB
  Compute Threads       4 (command line)

Lengths:
  Minimum read          0 bases
  Minimum overlap       500 bases

Overlap Error Rates:
  Graph                 0.105 (10.500%)
  Max                   0.105 (10.500%)

Deviations:
  Graph                 6.000
  Bubble                6.000
  Repeat                3.000

Edge Confusion:
  Absolute              2100
  Percent               200.0000

Unitig Construction:
  Minimum intersection  500 bases
  Maxiumum placements   2 positions

Debugging Enabled:
  (none)

==> LOADING AND FILTERING OVERLAPS.

ReadInfo()-- Using 1951114 reads, no minimum read length used.

OverlapCache()-- limited to 32768MB memory (user supplied).

OverlapCache()--      14MB for read data.
OverlapCache()--      74MB for best edges.
OverlapCache()--     193MB for tigs.
OverlapCache()--      52MB for tigs - read layouts.
OverlapCache()--      74MB for tigs - error profiles.
OverlapCache()--    8192MB for tigs - error profile overlaps.
OverlapCache()--       0MB for other processes.
OverlapCache()-- ---------
OverlapCache()--    8638MB for data structures (sum of above).
OverlapCache()-- ---------
OverlapCache()--      37MB for overlap store structure.
OverlapCache()--   24092MB for overlap data.
OverlapCache()-- ---------
OverlapCache()--   32768MB allowed.
OverlapCache()--
OverlapCache()-- Retain at least 4096 overlaps/read, based on 2048.23x coverage.
OverlapCache()-- Initial guess at 809 overlaps/read.
OverlapCache()--
OverlapCache()-- Not enough memory to load the minimum number of overlaps; increase -M.
[abertagnolli3@biocluster-6 4-unitigger]$ pwd
/nv/hp10/abertagnolli3/scratch/data/m54200_170628_222337_trap/unitigging/4-unitigger`

So, we modified the 'minMemory=50g' (oeaMemory=32 redMemory=32 were at these values) for the previous run. However, we a similar result:
`
==> PARAMETERS.

Resources:
  Memory                50 GB
  Compute Threads       4 (command line)

Lengths:
  Minimum read          0 bases
  Minimum overlap       500 bases

Overlap Error Rates:
  Graph                 0.105 (10.500%)
  Max                   0.105 (10.500%)

Deviations:
  Graph                 6.000
  Bubble                6.000
  Repeat                3.000

Edge Confusion:
  Absolute              2100
  Percent               200.0000

Unitig Construction:
  Minimum intersection  500 bases
  Maxiumum placements   2 positions

Debugging Enabled:
  (none)

==> LOADING AND FILTERING OVERLAPS.

ReadInfo()-- Using 1951114 reads, no minimum read length used.

OverlapCache()-- limited to 51200MB memory (user supplied).

OverlapCache()--      14MB for read data.
OverlapCache()--      74MB for best edges.
OverlapCache()--     193MB for tigs.
OverlapCache()--      52MB for tigs - read layouts.
OverlapCache()--      74MB for tigs - error profiles.
OverlapCache()--   12800MB for tigs - error profile overlaps.
OverlapCache()--       0MB for other processes.
OverlapCache()-- ---------
OverlapCache()--   13246MB for data structures (sum of above).
OverlapCache()-- ---------
OverlapCache()--      37MB for overlap store structure.
OverlapCache()--   37916MB for overlap data.
OverlapCache()-- ---------
OverlapCache()--   51200MB allowed.
OverlapCache()--
OverlapCache()-- Retain at least 4096 overlaps/read, based on 2048.23x coverage.
OverlapCache()-- Initial guess at 1273 overlaps/read.
OverlapCache()--
OverlapCache()-- Not enough memory to load the minimum number of overlaps; increase -M.`

It would seem that the memory requirement continues to increase just above the amount set: 13246+27+37916=51189 in the run where minMemory set to 50g 8638+37+24092= 32767 in the first run. Shouldn't the memory required stay the same? Also, if in the tip version used for previous runs we ran into the same issue. However, the program did produce '.trimmed.fasta.gz' and 'cleaned.fasta.gz' files, and the later could be used to re-start an assembly. For the version we just installed (v1.6), these files are not produced (i looked through several, but not all directories). What do we point the 'pacbio-cleaned' to?

Thanks,

-Tony

skoren commented 6 years ago

First regarding your trimmed.fasta.gz files. Canu never produces a cleaned file, it makes a correctedReads.fasta.gz and trimmedReads.fasta.gz in the top level of the assembly directory. The 1.6 release will make these. The latest tip will not make these by default as they are stored in the gatekeeper store instead unless you specify saveReads=true so I suspect your most recent install is actually the tip not the release.

As for memory, see issue #634. Because the genome size is set low Canu defaults to using a lower amount of memory for the unitigger as well which may not be enough to load the overlaps. The minimum you set may not be enough for unitigger. I'd recommend setting bogart memory higher than 100gb, batMemory=200 or batMemory=150 should be ok.

TonyMane commented 6 years ago

thanks for the support. a few more questions. the 'tip' version will print the below statement when supplied with the version; /nv/hp10/abertagnolli3/scratch/data/canu/Linux-amd64/bin/canu -version Canu snapshot v1.6 +61 changes (r8473 86f53cff1401ce4229d2a579ed093afe68751e0a)

and the stable version, or the version you would suggest using will look like this:

/nv/hp10/abertagnolli3/scratch/data/canu-master/Linux-amd64/bin/canu -version Canu 1.6

is that correct?

Second, we started the pipeline from scratch, using i what i think is the tip version. i believe now it is having a memory issue with mmap. can we increase this 'mmapMemory=5g'? I have attached the output of the first 'precompute.1.out-1 ' (there were 9 total, all with similar outputs, the amount of memory being listing being different for all 9).

thanks for you for your support!

canu_metagenome_mostRecent.pbs.txt

precompute.1.out-1.txt

skoren commented 6 years ago

Yes on the versions. The tip was unstable after 1.6 so unless you're on the latest as of this week I wouldn't use it.

Can you provide the canu preamble and the precompute.jobSubmit-01.sh, I want to see how much memory is being requested. Based on your setting of minMemory=150, I'd guess it is trying to use 150gb and the error reports it only had 50gb in resident memory at the time when allocating more memory failed so I would guess the machine was overloaded and didn't have enough memory to satisfy the 150gb request which would be a grid issue. I wouldn't set min memory that high or at all.

TonyMane commented 6 years ago

Yes, I believe this is the preamble. canu-most-recent.log.txt and the precompute.jobSubmit: precompute.jobSubmit-01.sh.txt

skoren commented 6 years ago

It is requesting 150gb of memory and is significantly below when it is trying to reserve more based on the output so I am not sure why the JVM is failing. Usually a JVM failure like that is caused by memory overload on the system. The tip hasn't changed from 1.6 in the way it runs the mhap step so if your asm ran earlier it should still work. I don't commonly use PBS so I'm not sure how its reservation system works, when you specify -l mem=150g -l nodes=4:ppn=20 does it mean that 150gb is spread over 4 nodes? That would be an issue since Canu's jobs are always single-node multi-core so you shouldn't need that -l nodes=4 option at all. I'd suggest the following command line;

canu -p recent -d /nv/hp10/abertagnolli3/scratch/data/m54200_170628_222337_recent genomeSize=5m -pacbio-raw /nv/hp10/abertagnolli3/scratch/data/m54200_170628_222337.fastq gnuplotTested=true gridEngineArrayMaxJobs=2000 useGrid=1 gridOptions='-q microbio-1 -t 2000' java='/usr/local/pacerepov1/java/1.8.0_25/bin/java' corMinCoverage=0 corOutCoverage=all corMhapSensitivity=high correctedErrorRate=0.105 oeaMemory=32 redMemory=32 batMemory=200

TonyMane commented 6 years ago

Yes, based on my rudimentary understanding/usage of PBS submissions and following the job submissions with qstat, i believe the previous command was distributing 150gb over 4 nodes.

TonyMane commented 6 years ago

THANKS!

TonyMane commented 6 years ago

Greetings, our assembly finally finished, and we are trying to assess the quality of the finished assembly. It would appear that the size of the largest contig is 50,000bp.

marbl / canu

qsub, -t option #672