Closed moudixtc closed 7 years ago
That sounds a whole like the machines are out of memory. Can you log in and check swap usage and/or paging rate? While you're there, grab the output of top; I'd like to see memory usage of these jobs.
If they are paging, the correct solution is to decrease mhapMemory to 30gb or so, but this will change the layout of the jobs and you'll lose all previous work. Instead, increase mhapThreads to 13 to get only 3 jobs per node.
I can't think of what else could cause this. The mhap jobs aren't heavy consumers of I/O. They do a large data load at the start, then (possibly) stream inputs (these might be loaded at the start) and write outputs. All work is done in correction/1-overlapper, the data loaded comes from blocks/ (via symlinks in queries), and the results are written into results/. After the mhap java compute is done, results are rewritten into a format the rest of canu can understand.
@brianwalenz Thanks for looking into it.
I went into the node with the longest running jobs (5+ days now) and grabbed these information
Swap Info:
ubuntu@ip-10-30-2-49:~$ free -t
total used free shared buffers cached
Mem: 165056384 164568708 487676 5808 158084 136553016
-/+ buffers/cache: 27857608 137198776
Swap: 0 0 0
Total: 165056384 164568708 487676
ubuntu@ip-10-30-2-49:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 85 0 480724 158204 136559456 0 0 0 0 1 1 1 0 79 20 0
2 85 0 480700 158204 136559456 0 0 0 0 117 156 0 0 63 37 0
0 85 0 480732 158204 136559456 0 0 0 0 211 349 0 0 62 37 0
0 85 0 480764 158204 136559456 0 0 0 0 169 141 0 0 62 38 0
0 85 0 480764 158204 136559456 0 0 0 0 280 150 0 0 62 38 0
0 85 0 480764 158204 136559456 0 0 0 0 426 173 0 0 63 37 0
0 85 0 480764 158204 136559456 0 0 0 24 612 258 0 0 62 38 0
0 85 0 480764 158204 136559456 0 0 0 0 294 178 0 0 63 37 0
0 85 0 480764 158204 136559456 0 0 0 0 207 139 0 0 62 38 0
0 85 0 480764 158204 136559456 0 0 0 0 154 165 0 0 63 37 0
0 85 0 480764 158204 136559456 0 0 0 0 195 209 0 0 62 38 0
0 85 0 480764 158204 136559456 0 0 0 0 275 147 0 0 63 37 0
0 85 0 480764 158204 136559456 0 0 0 0 298 200 0 0 62 37 0
0 85 0 480888 158204 136559456 0 0 0 228 221 167 0 0 62 38 0
Top Info:
top - 15:44:05 up 5 days, 14:20, 1 user, load average: 86.50, 86.41, 85.97
Tasks: 1172 total, 2 running, 1170 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 62.5 id, 37.5 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16505638+total, 16458408+used, 472308 free, 158232 buffers
KiB Swap: 0 total, 0 used, 0 free. 13655950+cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31314 ubuntu 20 0 24624 2624 1144 R 1.0 0.0 0:00.08 top
499 root 20 0 0 0 0 S 0.3 0.0 9:36.88 kworker/18:1
1779 root 20 0 19812 1312 516 S 0.3 0.0 4:11.99 irqbalance
17647 ubuntu 20 0 44.614g 0.024t 10252 S 0.3 15.5 516:41.81 java
1 root 20 0 33640 2852 1360 S 0.0 0.0 0:02.99 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/u256:0
8 root 20 0 0 0 0 R 0.0 0.0 0:21.72 rcu_sched
9 root 20 0 0 0 0 S 0.0 0.0 0:10.67 rcuos/0
10 root 20 0 0 0 0 S 0.0 0.0 0:03.97 rcuos/1
11 root 20 0 0 0 0 S 0.0 0.0 0:05.08 rcuos/2
12 root 20 0 0 0 0 S 0.0 0.0 0:03.83 rcuos/3
13 root 20 0 0 0 0 S 0.0 0.0 0:02.66 rcuos/4
14 root 20 0 0 0 0 S 0.0 0.0 0:01.43 rcuos/5
15 root 20 0 0 0 0 S 0.0 0.0 0:01.92 rcuos/6
16 root 20 0 0 0 0 S 0.0 0.0 0:01.03 rcuos/7
17 root 20 0 0 0 0 S 0.0 0.0 0:00.86 rcuos/8
18 root 20 0 0 0 0 S 0.0 0.0 0:01.21 rcuos/9
19 root 20 0 0 0 0 S 0.0 0.0 0:01.89 rcuos/10
20 root 20 0 0 0 0 S 0.0 0.0 0:02.08 rcuos/11
21 root 20 0 0 0 0 S 0.0 0.0 0:02.02 rcuos/12
22 root 20 0 0 0 0 S 0.0 0.0 0:01.68 rcuos/13
23 root 20 0 0 0 0 S 0.0 0.0 0:01.49 rcuos/14
24 root 20 0 0 0 0 S 0.0 0.0 0:01.56 rcuos/15
25 root 20 0 0 0 0 S 0.0 0.0 0:01.34 rcuos/16
26 root 20 0 0 0 0 S 0.0 0.0 0:01.03 rcuos/17
27 root 20 0 0 0 0 S 0.0 0.0 0:04.89 rcuos/18
28 root 20 0 0 0 0 S 0.0 0.0 0:01.42 rcuos/19
29 root 20 0 0 0 0 S 0.0 0.0 0:00.28 rcuos/20
30 root 20 0 0 0 0 S 0.0 0.0 0:00.97 rcuos/21
31 root 20 0 0 0 0 S 0.0 0.0 0:00.23 rcuos/22
32 root 20 0 0 0 0 S 0.0 0.0 0:00.35 rcuos/23
33 root 20 0 0 0 0 S 0.0 0.0 0:00.27 rcuos/24
34 root 20 0 0 0 0 S 0.0 0.0 0:00.18 rcuos/25
35 root 20 0 0 0 0 S 0.0 0.0 0:00.30 rcuos/26
36 root 20 0 0 0 0 S 0.0 0.0 0:00.11 rcuos/27
37 root 20 0 0 0 0 S 0.0 0.0 0:00.13 rcuos/28
38 root 20 0 0 0 0 S 0.0 0.0 0:02.17 rcuos/29
ps Info:
ubuntu@ip-10-30-2-49:~$ ps -ef | grep canu
ubuntu 17647 17641 7 Jul21 ? 08:36:41 java -d64 -server -Xmx39936m -jar /opt/canu/bin/mhap-2.1.2.jar --repeat-weight 0.9 --repeat-idf-scale 10 -k 16 --num-hashes 256 --num-min-matches 3 --threshold 0.8 --filter-threshold 0.000005 --ordered-sketch-size 1000 --ordered-kmer-size 14 --num-threads 10 -s ./blocks/000022.dat -q queries/000086
ubuntu 19683 13394 0 Jul21 ? 00:07:24 /opt/canu/bin/mhapConvert -G ../maize_01DKD2.gkpStore -h 1872001 117000 -q 5031001 -o ./results/000067.mhap.ovb.WORKING ./results/000067.mhap
ubuntu 19730 13390 0 Jul21 ? 00:07:19 /opt/canu/bin/mhapConvert -G ../maize_01DKD2.gkpStore -h 1872001 117000 -q 8073001 -o ./results/000068.mhap.ovb.WORKING ./results/000068.mhap
ubuntu 19754 13389 0 Jul21 ? 00:08:26 /opt/canu/bin/mhapConvert -G ../maize_01DKD2.gkpStore -h 1872001 0 -q 1872001 -o ./results/000066.mhap.ovb.WORKING ./results/000066.mhap
ubuntu 31319 31298 0 15:45 pts/0 00:00:00 grep --color=auto canu
3 jobs on this node are 19_[66-68] and it looks like they have finished the java compute step on Jul 21, and the reformat step is still running:
-rw-rw-r-- 1 ubuntu ubuntu 48156 Jul 21 16:05 mhap.19_66.out
-rw-rw-r-- 1 ubuntu ubuntu 48092 Jul 21 15:50 mhap.19_68.out
-rw-rw-r-- 1 ubuntu ubuntu 48095 Jul 21 15:47 mhap.19_87.out
-rw-rw-r-- 1 ubuntu ubuntu 48040 Jul 21 15:25 mhap.19_67.out
-rw-rw-r-- 1 ubuntu ubuntu 8555331584 Jul 25 15:52 000066.mhap.ovb.WORKING
-rw-rw-r-- 1 ubuntu ubuntu 32097733339 Jul 21 15:25 000067.mhap
-rw-rw-r-- 1 ubuntu ubuntu 7406092288 Jul 25 13:22 000067.mhap.ovb.WORKING
-rw-rw-r-- 1 ubuntu ubuntu 31871373877 Jul 21 15:50 000068.mhap
-rw-rw-r-- 1 ubuntu ubuntu 7253000192 Jul 25 15:11 000068.mhap.ovb.WORKING
Looking back at the original config, I am not sure why only 5 jobs would be running. The canu config is:
-- Found 5 hosts with 40 cores and 157 GB memory under Slurm control.
--
-- Run under grid control using 39 GB and 10 CPUs for stage 'mhap (cor)'.
So that means you can fit 157/39 = 4 jobs per machine so 20 should be running at a time not 5. You can see how Canu is submitting these jobs in the 1-overlapper/mhap.jobSubmit.sh file so if you're only getting 5 running on your machine something must be wrong in the scheduling.
Your machines have no swap (based on top)? That generally causes issues under high load. You need to run at about 80% of the memory capacity to leave room for OS processes when the machine is fully loaded so 3 jobs at 39g = should be good. Running with 4 at a time (at 39g) will use 160Gb + 10% OS overhead and would over-subscribe the memory. I think @brianwalenz suggestion to increase cores should fix the out of memory issues.
Also, in your one top example, the machine is running out of memory but the java process is only using 15%. When you sort by memory in top, what else is running/consuming memory on that machine.
I didn't notice the mhapConvert jobs until now. These are just rewriting the 32gb ASCII input files into a smaller/different binary format. These should be quick, reading 32gb and writing (it appears) 8gb output. Four days is very very very slow. This could have caused the remaining mhap jobs to stall on writing output.
Are you out of disk space?
@skoren Thanks for taking a look.
There were only 5 jobs because of the default slurm configuration, which limits each node to run at most one job at a time. That's why I changed the slurm config, as I mentioned in my original post, to allow more than one job on each node.
This is the result from top sorted by mem
top - 19:15:13 up 7 days, 17:47, 1 user, load average: 15.02, 14.73, 14.24
Tasks: 669 total, 1 running, 667 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 74.9 id, 25.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16505638+total, 13764060+used, 27415768 free, 156084 buffers
KiB Swap: 0 total, 0 used, 0 free. 11103451+cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23321 ubuntu 20 0 44.614g 0.023t 10708 S 0.0 14.9 829:49.29 java
25895 ubuntu 20 0 263452 249888 193632 D 0.0 0.2 2:29.07 mhapConvert
24910 ubuntu 20 0 235152 221596 193632 D 0.0 0.1 6:51.81 mhapConvert
3716 root 20 0 61524 21516 3964 S 0.0 0.0 3:35.72 nodewatcher
1921 root 20 0 225392 18428 11536 S 0.0 0.0 0:20.88 apache2
3712 root 20 0 50300 13712 1300 S 0.0 0.0 1:21.20 supervisord
28297 www-data 20 0 225416 7300 396 S 0.0 0.0 0:00.00 apache2
28298 www-data 20 0 225416 7300 396 S 0.0 0.0 0:00.00 apache2
28299 www-data 20 0 225416 7300 396 S 0.0 0.0 0:00.00 apache2
28300 www-data 20 0 225416 7300 396 S 0.0 0.0 0:00.00 apache2
28301 www-data 20 0 225416 7300 396 S 0.0 0.0 0:00.00 apache2
38229 root 20 0 105652 4216 3224 S 0.0 0.0 0:00.01 sshd
1531 syslog 20 0 260080 4132 836 S 0.0 0.0 0:01.05 rsyslogd
38336 ubuntu 20 0 21436 3824 1712 S 0.0 0.0 0:00.06 bash
13316 root 20 0 265984 3436 2088 S 0.0 0.0 0:02.77 slurmd
1702 root 20 0 61388 3044 2368 S 0.0 0.0 0:01.14 sshd
1196 root 20 0 10228 2904 600 S 0.0 0.0 0:00.31 dhclient
30417 root 20 0 200604 2728 1860 S 0.0 0.0 0:06.97 slurmstepd
22325 root 20 0 200604 2720 1852 S 0.0 0.0 0:16.13 slurmstepd
23309 root 20 0 200604 2720 1852 S 0.0 0.0 0:15.19 slurmstepd
1 root 20 0 33648 2596 1104 S 0.0 0.0 0:02.96 init
21142 root 20 0 200604 2552 1684 S 0.0 0.0 0:14.56 slurmstepd
1816 nobody 20 0 234620 2348 1148 S 0.0 0.0 2:16.95 gmetad
38356 ubuntu 20 0 24192 2192 1152 R 0.3 0.0 0:04.30 top
3774 munge 20 0 225296 2164 1128 S 0.0 0.0 0:06.53 munged
38334 ubuntu 20 0 105652 1876 896 S 0.0 0.0 0:00.06 sshd
1465 root 20 0 43456 1636 1296 S 0.0 0.0 0:00.00 systemd-logind
3676 nobody 20 0 44964 1528 1060 S 0.0 0.0 3:52.68 gmond
37656 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
37698 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
37761 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
37811 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
37864 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
37911 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
38003 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
38068 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
38171 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
38380 root 20 0 59644 1496 1044 S 0.0 0.0 0:00.00 cron
937 root 20 0 51240 1284 744 S 0.0 0.0 0:00.12 systemd-udevd
1745 root 20 0 19812 1252 456 S 0.0 0.0 4:46.80 irqbalance
1452 message+ 20 0 39244 1188 808 S 0.0 0.0 0:00.05 dbus-daemon
I noticed the large chunk of the used memory being cache memory, which is actually available resource according to this article http://www.linuxatemyram.com/. I'm also thinking what Brian mentioned about manually increase the thread requirement so that each node is limited to 3 jobs at a time, but this probably means that I also need to do it for the other stages as well.
Another solution in my mind is to configure the scheduler to allocate a max of 80% of its node resource, but no luck yet and also I find that the bogart stage would use up the whole thing, making it not able to be scheduled at all without manually limiting its resource requirement.
-- Run under grid control using 157 GB and 32 CPUs for stage 'bogart'.
@brianwalenz No, everything reads and writes to AWS EFS, which is like an NFS service provided by AWS and is pretty much unlimited space.
I missed that there was some other slurm config limiting the jobs, so that makes sense.
Given the top output, I don't think memory is an issue since nothing else is using it on the system. It is strange that one job per machine ran in 7hrs but multiple jobs are taking over 5 days to convert a file. That should be nearly instantaneous since it's just streaming a file and writing and the same amount of data is generated when you run 1 job/machine or 4 jobs/machine.
Do these machines have local scratch space? I would edit the mhap.sh file to write output to local scratch if it is available and perhaps even copy the queries/blocks locally too if there is enough space. See how long jobs take then running on local disk 4 at a time per node. That would eliminate network/NFS issues.
Any update?
Not yet, will try Sergey's suggestion and let you guys know.
Closing, inactive. Please open a new issue if you encounter an error with this assembly/run.
Hi, I'm not sure if this question should be posted here, so please direct me to the correct place if needed.
I'm running Canu release v1.5 on a cluster of 5 AWS EC2 instances with ubuntu 14.04, each has 40 CPUs, and 160GB memory, and all connected to AWS EFS. The scheduler is slurm 16.05.3. The command I ran is basically the following but with different directories
canu -p maize_01DKD2 -d maize_01DKD2-auto genomeSize=2400m -pacbio-raw *.fastq
.Here is the partial output from canu.*.out:
With the default slurm configuration, only one job can be running at the time on each node, so for my setup, 5 jobs are running simultaneously. I was at the
'mhap (cor)'
stage, and each job took about 7 hours to finish. Not sure if that time sounds about right, but I felt it was slow, and since only one job is running on each node, about 75% of the node resource is not being used. As a result, I changed the slurm configuration to allow more than one job on each node. Basically fromSelectType=select/linear
toSelectType=select/cons_res
andSelectTypeParameters=CR_CPU_Memory
.Now I have 20 jobs running at the same time, which is expected, but the time each job took is significantly longer. As you can see from the following results, some jobs have been running for almost 5 days.
The CPU usage for these nodes also seems to be low, 3 of the 5 has an average of around 1%, while the other 2 are at about 5-10%.
Do you have any ideas on where the problem might be? Also you mentioned that canu works pretty well with slurm, so I'm also wondering if you have a recommended slurm configuration.