Performance issue/question with grid engine option

moudixtc commented 7 years ago

Hi, I'm not sure if this question should be posted here, so please direct me to the correct place if needed.

I'm running Canu release v1.5 on a cluster of 5 AWS EC2 instances with ubuntu 14.04, each has 40 CPUs, and 160GB memory, and all connected to AWS EFS. The scheduler is slurm 16.05.3. The command I ran is basically the following but with different directories canu -p maize_01DKD2 -d maize_01DKD2-auto genomeSize=2400m -pacbio-raw *.fastq.

Here is the partial output from canu.*.out:

-- Canu release v1.5
-- Detected Java(TM) Runtime Environment '1.8.0_131' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 4' (from 'gnuplot') and image format 'png'.
-- Detected 40 CPUs and 157 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /opt/slurm/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found   5 hosts with  40 cores and  157 GB memory under Slurm control.
--
-- Run under grid control using   78 GB and  20 CPUs for stage 'meryl'.
-- Run under grid control using   39 GB and  10 CPUs for stage 'mhap (cor)'.
-- Run under grid control using   16 GB and   8 CPUs for stage 'overlapper (obt)'.
-- Run under grid control using   16 GB and   8 CPUs for stage 'overlapper (utg)'.
-- Run under grid control using   15 GB and   4 CPUs for stage 'falcon_sense'.
-- Run under grid control using    3 GB and   1 CPU  for stage 'ovStore bucketizer'.
-- Run under grid control using   32 GB and   1 CPU  for stage 'ovStore sorting'.
-- Run under grid control using   16 GB and   8 CPUs for stage 'read error detection'.
-- Run under grid control using    4 GB and   1 CPU  for stage 'overlap error adjustment'.
-- Run under grid control using  157 GB and  32 CPUs for stage 'bogart'.
-- Run under grid control using   16 GB and  10 CPUs for stage 'GFA alignment and processing'.
-- Run under grid control using   78 GB and   8 CPUs for stage 'consensus'.

With the default slurm configuration, only one job can be running at the time on each node, so for my setup, 5 jobs are running simultaneously. I was at the 'mhap (cor)' stage, and each job took about 7 hours to finish. Not sure if that time sounds about right, but I felt it was slow, and since only one job is running on each node, about 75% of the node resource is not being used. As a result, I changed the slurm configuration to allow more than one job on each node. Basically from SelectType=select/linear to SelectType=select/cons_res and SelectTypeParameters=CR_CPU_Memory.

Now I have 20 jobs running at the same time, which is expected, but the time each job took is significantly longer. As you can see from the following results, some jobs have been running for almost 5 days.

ubuntu@ip-10-30-3-199:/efs/home/dmou-test/maize_01DKD2-auto$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      19_[109-265]   compute cormhap_   ubuntu PD       0:00      1 (Resources)
                22   compute canu_mai   ubuntu PD       0:00      1 (Dependency)
            19_108   compute cormhap_   ubuntu  R    2:28:22      1 ip-10-30-2-40
            19_107   compute cormhap_   ubuntu  R   12:06:37      1 ip-10-30-2-40
            19_106   compute cormhap_   ubuntu  R   12:17:00      1 ip-10-30-2-116
            19_104   compute cormhap_   ubuntu  R 1-08:41:43      1 ip-10-30-2-237
            19_102   compute cormhap_   ubuntu  R 1-14:56:51      1 ip-10-30-2-116
            19_100   compute cormhap_   ubuntu  R 1-19:30:43      1 ip-10-30-2-116
             19_99   compute cormhap_   ubuntu  R 2-02:35:05      1 ip-10-30-2-237
             19_98   compute cormhap_   ubuntu  R 2-02:45:13      1 ip-10-30-2-116
             19_96   compute cormhap_   ubuntu  R 2-08:13:51      1 ip-10-30-2-237
             19_94   compute cormhap_   ubuntu  R 2-15:04:39      1 ip-10-30-2-237
             19_88   compute cormhap_   ubuntu  R 3-12:04:33      1 ip-10-30-2-40
             19_87   compute cormhap_   ubuntu  R 3-17:31:15      1 ip-10-30-2-40
             19_86   compute cormhap_   ubuntu  R 3-17:51:45      1 ip-10-30-2-49
             19_73   compute cormhap_   ubuntu  R 4-19:25:46      1 ip-10-30-2-29
             19_74   compute cormhap_   ubuntu  R 4-19:25:46      1 ip-10-30-2-29
             19_75   compute cormhap_   ubuntu  R 4-19:25:46      1 ip-10-30-2-29
             19_76   compute cormhap_   ubuntu  R 4-19:25:46      1 ip-10-30-2-29
             19_66   compute cormhap_   ubuntu  R 4-19:28:56      1 ip-10-30-2-49
             19_67   compute cormhap_   ubuntu  R 4-19:28:56      1 ip-10-30-2-49
             19_68   compute cormhap_   ubuntu  R 4-19:28:56      1 ip-10-30-2-49

The CPU usage for these nodes also seems to be low, 3 of the 5 has an average of around 1%, while the other 2 are at about 5-10%.

Do you have any ideas on where the problem might be? Also you mentioned that canu works pretty well with slurm, so I'm also wondering if you have a recommended slurm configuration.

brianwalenz commented 7 years ago

That sounds a whole like the machines are out of memory. Can you log in and check swap usage and/or paging rate? While you're there, grab the output of top; I'd like to see memory usage of these jobs.

If they are paging, the correct solution is to decrease mhapMemory to 30gb or so, but this will change the layout of the jobs and you'll lose all previous work. Instead, increase mhapThreads to 13 to get only 3 jobs per node.

I can't think of what else could cause this. The mhap jobs aren't heavy consumers of I/O. They do a large data load at the start, then (possibly) stream inputs (these might be loaded at the start) and write outputs. All work is done in correction/1-overlapper, the data loaded comes from blocks/ (via symlinks in queries), and the results are written into results/. After the mhap java compute is done, results are rewritten into a format the rest of canu can understand.

moudixtc commented 7 years ago

@brianwalenz Thanks for looking into it.

I went into the node with the longest running jobs (5+ days now) and grabbed these information

Swap Info:

ubuntu@ip-10-30-2-49:~$ free -t
             total       used       free     shared    buffers     cached
Mem:     165056384  164568708     487676       5808     158084  136553016
-/+ buffers/cache:   27857608  137198776
Swap:            0          0          0
Total:   165056384  164568708     487676

ubuntu@ip-10-30-2-49:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 85      0 480724 158204 136559456    0    0     0     0    1    1  1  0 79 20  0
 2 85      0 480700 158204 136559456    0    0     0     0  117  156  0  0 63 37  0
 0 85      0 480732 158204 136559456    0    0     0     0  211  349  0  0 62 37  0
 0 85      0 480764 158204 136559456    0    0     0     0  169  141  0  0 62 38  0
 0 85      0 480764 158204 136559456    0    0     0     0  280  150  0  0 62 38  0
 0 85      0 480764 158204 136559456    0    0     0     0  426  173  0  0 63 37  0
 0 85      0 480764 158204 136559456    0    0     0    24  612  258  0  0 62 38  0
 0 85      0 480764 158204 136559456    0    0     0     0  294  178  0  0 63 37  0
 0 85      0 480764 158204 136559456    0    0     0     0  207  139  0  0 62 38  0
 0 85      0 480764 158204 136559456    0    0     0     0  154  165  0  0 63 37  0
 0 85      0 480764 158204 136559456    0    0     0     0  195  209  0  0 62 38  0
 0 85      0 480764 158204 136559456    0    0     0     0  275  147  0  0 63 37  0
 0 85      0 480764 158204 136559456    0    0     0     0  298  200  0  0 62 37  0
 0 85      0 480888 158204 136559456    0    0     0   228  221  167  0  0 62 38  0

Top Info:

top - 15:44:05 up 5 days, 14:20,  1 user,  load average: 86.50, 86.41, 85.97
Tasks: 1172 total,   2 running, 1170 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 62.5 id, 37.5 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16505638+total, 16458408+used,   472308 free,   158232 buffers
KiB Swap:        0 total,        0 used,        0 free. 13655950+cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                              
 31314 ubuntu    20   0   24624   2624   1144 R   1.0  0.0   0:00.08 top                                                                                                  
   499 root      20   0       0      0      0 S   0.3  0.0   9:36.88 kworker/18:1                                                                                         
  1779 root      20   0   19812   1312    516 S   0.3  0.0   4:11.99 irqbalance                                                                                           
 17647 ubuntu    20   0 44.614g 0.024t  10252 S   0.3 15.5 516:41.81 java                                                                                                 
     1 root      20   0   33640   2852   1360 S   0.0  0.0   0:02.99 init                                                                                                 
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.03 kthreadd                                                                                             
     3 root      20   0       0      0      0 S   0.0  0.0   0:00.01 ksoftirqd/0                                                                                          
     4 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0                                                                                          
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                         
     6 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kworker/u256:0                                                                                       
     8 root      20   0       0      0      0 R   0.0  0.0   0:21.72 rcu_sched                                                                                            
     9 root      20   0       0      0      0 S   0.0  0.0   0:10.67 rcuos/0                                                                                              
    10 root      20   0       0      0      0 S   0.0  0.0   0:03.97 rcuos/1                                                                                              
    11 root      20   0       0      0      0 S   0.0  0.0   0:05.08 rcuos/2                                                                                              
    12 root      20   0       0      0      0 S   0.0  0.0   0:03.83 rcuos/3                                                                                              
    13 root      20   0       0      0      0 S   0.0  0.0   0:02.66 rcuos/4                                                                                              
    14 root      20   0       0      0      0 S   0.0  0.0   0:01.43 rcuos/5                                                                                              
    15 root      20   0       0      0      0 S   0.0  0.0   0:01.92 rcuos/6                                                                                              
    16 root      20   0       0      0      0 S   0.0  0.0   0:01.03 rcuos/7                                                                                              
    17 root      20   0       0      0      0 S   0.0  0.0   0:00.86 rcuos/8                                                                                              
    18 root      20   0       0      0      0 S   0.0  0.0   0:01.21 rcuos/9                                                                                              
    19 root      20   0       0      0      0 S   0.0  0.0   0:01.89 rcuos/10                                                                                             
    20 root      20   0       0      0      0 S   0.0  0.0   0:02.08 rcuos/11                                                                                             
    21 root      20   0       0      0      0 S   0.0  0.0   0:02.02 rcuos/12                                                                                             
    22 root      20   0       0      0      0 S   0.0  0.0   0:01.68 rcuos/13                                                                                             
    23 root      20   0       0      0      0 S   0.0  0.0   0:01.49 rcuos/14                                                                                             
    24 root      20   0       0      0      0 S   0.0  0.0   0:01.56 rcuos/15                                                                                             
    25 root      20   0       0      0      0 S   0.0  0.0   0:01.34 rcuos/16                                                                                             
    26 root      20   0       0      0      0 S   0.0  0.0   0:01.03 rcuos/17                                                                                             
    27 root      20   0       0      0      0 S   0.0  0.0   0:04.89 rcuos/18                                                                                             
    28 root      20   0       0      0      0 S   0.0  0.0   0:01.42 rcuos/19                                                                                             
    29 root      20   0       0      0      0 S   0.0  0.0   0:00.28 rcuos/20                                                                                             
    30 root      20   0       0      0      0 S   0.0  0.0   0:00.97 rcuos/21                                                                                             
    31 root      20   0       0      0      0 S   0.0  0.0   0:00.23 rcuos/22                                                                                             
    32 root      20   0       0      0      0 S   0.0  0.0   0:00.35 rcuos/23                                                                                             
    33 root      20   0       0      0      0 S   0.0  0.0   0:00.27 rcuos/24                                                                                             
    34 root      20   0       0      0      0 S   0.0  0.0   0:00.18 rcuos/25                                                                                             
    35 root      20   0       0      0      0 S   0.0  0.0   0:00.30 rcuos/26                                                                                             
    36 root      20   0       0      0      0 S   0.0  0.0   0:00.11 rcuos/27                                                                                             
    37 root      20   0       0      0      0 S   0.0  0.0   0:00.13 rcuos/28                                                                                             
    38 root      20   0       0      0      0 S   0.0  0.0   0:02.17 rcuos/29 

ps Info:

ubuntu@ip-10-30-2-49:~$ ps -ef | grep canu
ubuntu    17647  17641  7 Jul21 ?        08:36:41 java -d64 -server -Xmx39936m -jar /opt/canu/bin/mhap-2.1.2.jar --repeat-weight 0.9 --repeat-idf-scale 10 -k 16 --num-hashes 256 --num-min-matches 3 --threshold 0.8 --filter-threshold 0.000005 --ordered-sketch-size 1000 --ordered-kmer-size 14 --num-threads 10 -s ./blocks/000022.dat -q queries/000086
ubuntu    19683  13394  0 Jul21 ?        00:07:24 /opt/canu/bin/mhapConvert -G ../maize_01DKD2.gkpStore -h 1872001 117000 -q 5031001 -o ./results/000067.mhap.ovb.WORKING ./results/000067.mhap
ubuntu    19730  13390  0 Jul21 ?        00:07:19 /opt/canu/bin/mhapConvert -G ../maize_01DKD2.gkpStore -h 1872001 117000 -q 8073001 -o ./results/000068.mhap.ovb.WORKING ./results/000068.mhap
ubuntu    19754  13389  0 Jul21 ?        00:08:26 /opt/canu/bin/mhapConvert -G ../maize_01DKD2.gkpStore -h 1872001 0 -q 1872001 -o ./results/000066.mhap.ovb.WORKING ./results/000066.mhap
ubuntu    31319  31298  0 15:45 pts/0    00:00:00 grep --color=auto canu

3 jobs on this node are 19_[66-68] and it looks like they have finished the java compute step on Jul 21, and the reformat step is still running:

-rw-rw-r--   1 ubuntu ubuntu   48156 Jul 21 16:05 mhap.19_66.out
-rw-rw-r--   1 ubuntu ubuntu   48092 Jul 21 15:50 mhap.19_68.out
-rw-rw-r--   1 ubuntu ubuntu   48095 Jul 21 15:47 mhap.19_87.out
-rw-rw-r--   1 ubuntu ubuntu   48040 Jul 21 15:25 mhap.19_67.out

-rw-rw-r-- 1 ubuntu ubuntu  8555331584 Jul 25 15:52 000066.mhap.ovb.WORKING
-rw-rw-r-- 1 ubuntu ubuntu 32097733339 Jul 21 15:25 000067.mhap
-rw-rw-r-- 1 ubuntu ubuntu  7406092288 Jul 25 13:22 000067.mhap.ovb.WORKING
-rw-rw-r-- 1 ubuntu ubuntu 31871373877 Jul 21 15:50 000068.mhap
-rw-rw-r-- 1 ubuntu ubuntu  7253000192 Jul 25 15:11 000068.mhap.ovb.WORKING

skoren commented 7 years ago

Looking back at the original config, I am not sure why only 5 jobs would be running. The canu config is:

-- Found   5 hosts with  40 cores and  157 GB memory under Slurm control.
--
-- Run under grid control using   39 GB and  10 CPUs for stage 'mhap (cor)'.

So that means you can fit 157/39 = 4 jobs per machine so 20 should be running at a time not 5. You can see how Canu is submitting these jobs in the 1-overlapper/mhap.jobSubmit.sh file so if you're only getting 5 running on your machine something must be wrong in the scheduling.

Your machines have no swap (based on top)? That generally causes issues under high load. You need to run at about 80% of the memory capacity to leave room for OS processes when the machine is fully loaded so 3 jobs at 39g = should be good. Running with 4 at a time (at 39g) will use 160Gb + 10% OS overhead and would over-subscribe the memory. I think @brianwalenz suggestion to increase cores should fix the out of memory issues.

Also, in your one top example, the machine is running out of memory but the java process is only using 15%. When you sort by memory in top, what else is running/consuming memory on that machine.

brianwalenz commented 7 years ago

I didn't notice the mhapConvert jobs until now. These are just rewriting the 32gb ASCII input files into a smaller/different binary format. These should be quick, reading 32gb and writing (it appears) 8gb output. Four days is very very very slow. This could have caused the remaining mhap jobs to stall on writing output.

Are you out of disk space?

moudixtc commented 7 years ago

@skoren Thanks for taking a look.

There were only 5 jobs because of the default slurm configuration, which limits each node to run at most one job at a time. That's why I changed the slurm config, as I mentioned in my original post, to allow more than one job on each node.

This is the result from top sorted by mem

top - 19:15:13 up 7 days, 17:47,  1 user,  load average: 15.02, 14.73, 14.24
Tasks: 669 total,   1 running, 667 sleeping,   0 stopped,   1 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 74.9 id, 25.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16505638+total, 13764060+used, 27415768 free,   156084 buffers
KiB Swap:        0 total,        0 used,        0 free. 11103451+cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                              
 23321 ubuntu    20   0 44.614g 0.023t  10708 S   0.0 14.9 829:49.29 java                                                                                                 
 25895 ubuntu    20   0  263452 249888 193632 D   0.0  0.2   2:29.07 mhapConvert                                                                                          
 24910 ubuntu    20   0  235152 221596 193632 D   0.0  0.1   6:51.81 mhapConvert                                                                                          
  3716 root      20   0   61524  21516   3964 S   0.0  0.0   3:35.72 nodewatcher                                                                                          
  1921 root      20   0  225392  18428  11536 S   0.0  0.0   0:20.88 apache2                                                                                              
  3712 root      20   0   50300  13712   1300 S   0.0  0.0   1:21.20 supervisord                                                                                          
 28297 www-data  20   0  225416   7300    396 S   0.0  0.0   0:00.00 apache2                                                                                              
 28298 www-data  20   0  225416   7300    396 S   0.0  0.0   0:00.00 apache2                                                                                              
 28299 www-data  20   0  225416   7300    396 S   0.0  0.0   0:00.00 apache2                                                                                              
 28300 www-data  20   0  225416   7300    396 S   0.0  0.0   0:00.00 apache2                                                                                              
 28301 www-data  20   0  225416   7300    396 S   0.0  0.0   0:00.00 apache2                                                                                              
 38229 root      20   0  105652   4216   3224 S   0.0  0.0   0:00.01 sshd                                                                                                 
  1531 syslog    20   0  260080   4132    836 S   0.0  0.0   0:01.05 rsyslogd                                                                                             
 38336 ubuntu    20   0   21436   3824   1712 S   0.0  0.0   0:00.06 bash                                                                                                 
 13316 root      20   0  265984   3436   2088 S   0.0  0.0   0:02.77 slurmd                                                                                               
  1702 root      20   0   61388   3044   2368 S   0.0  0.0   0:01.14 sshd                                                                                                 
  1196 root      20   0   10228   2904    600 S   0.0  0.0   0:00.31 dhclient                                                                                             
 30417 root      20   0  200604   2728   1860 S   0.0  0.0   0:06.97 slurmstepd                                                                                           
 22325 root      20   0  200604   2720   1852 S   0.0  0.0   0:16.13 slurmstepd                                                                                           
 23309 root      20   0  200604   2720   1852 S   0.0  0.0   0:15.19 slurmstepd                                                                                           
     1 root      20   0   33648   2596   1104 S   0.0  0.0   0:02.96 init                                                                                                 
 21142 root      20   0  200604   2552   1684 S   0.0  0.0   0:14.56 slurmstepd                                                                                           
  1816 nobody    20   0  234620   2348   1148 S   0.0  0.0   2:16.95 gmetad                                                                                               
 38356 ubuntu    20   0   24192   2192   1152 R   0.3  0.0   0:04.30 top                                                                                                  
  3774 munge     20   0  225296   2164   1128 S   0.0  0.0   0:06.53 munged                                                                                               
 38334 ubuntu    20   0  105652   1876    896 S   0.0  0.0   0:00.06 sshd                                                                                                 
  1465 root      20   0   43456   1636   1296 S   0.0  0.0   0:00.00 systemd-logind                                                                                       
  3676 nobody    20   0   44964   1528   1060 S   0.0  0.0   3:52.68 gmond                                                                                                
 37656 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
 37698 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
 37761 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
 37811 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
 37864 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
 37911 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
 38003 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
 38068 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
 38171 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
 38380 root      20   0   59644   1496   1044 S   0.0  0.0   0:00.00 cron                                                                                                 
   937 root      20   0   51240   1284    744 S   0.0  0.0   0:00.12 systemd-udevd                                                                                        
  1745 root      20   0   19812   1252    456 S   0.0  0.0   4:46.80 irqbalance                                                                                           
  1452 message+  20   0   39244   1188    808 S   0.0  0.0   0:00.05 dbus-daemon

I noticed the large chunk of the used memory being cache memory, which is actually available resource according to this article http://www.linuxatemyram.com/. I'm also thinking what Brian mentioned about manually increase the thread requirement so that each node is limited to 3 jobs at a time, but this probably means that I also need to do it for the other stages as well.

Another solution in my mind is to configure the scheduler to allocate a max of 80% of its node resource, but no luck yet and also I find that the bogart stage would use up the whole thing, making it not able to be scheduled at all without manually limiting its resource requirement.

-- Run under grid control using  157 GB and  32 CPUs for stage 'bogart'.

moudixtc commented 7 years ago

@brianwalenz No, everything reads and writes to AWS EFS, which is like an NFS service provided by AWS and is pretty much unlimited space.

skoren commented 7 years ago

I missed that there was some other slurm config limiting the jobs, so that makes sense.

Given the top output, I don't think memory is an issue since nothing else is using it on the system. It is strange that one job per machine ran in 7hrs but multiple jobs are taking over 5 days to convert a file. That should be nearly instantaneous since it's just streaming a file and writing and the same amount of data is generated when you run 1 job/machine or 4 jobs/machine.

Do these machines have local scratch space? I would edit the mhap.sh file to write output to local scratch if it is available and perhaps even copy the queries/blocks locally too if there is enough space. See how long jobs take then running on local disk 4 at a time per node. That would eliminate network/NFS issues.

brianwalenz commented 7 years ago

Any update?

moudixtc commented 7 years ago

Not yet, will try Sergey's suggestion and let you guys know.

skoren commented 7 years ago

Closing, inactive. Please open a new issue if you encounter an error with this assembly/run.

marbl / canu

Performance issue/question with grid engine option #562