Open blokhin opened 3 years ago
Continues to occur on about every 20-th calculation on Hetzner:
[node-kcjlndpb][[7780,1],5][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-kcjlndpb][[7780,1],5][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-kcjlndpb][[7780,1],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
And again now:
root@aiida9:~# yastatus
319 RUNNING
root@aiida9:~# yastatus -v
..................................................ID319 aiida-4727 at root@65.108.215.129:hetzner:data/tasks/20221202_194808_319
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],7][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
so it's just eating the resources without any useful payload :cry: :cry: :cry:
$ top
Tasks: 130 total, 1 running, 129 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.6 us, 1.6 sy, 0.0 ni, 96.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31413.5 total, 29556.7 free, 246.1 used, 1610.8 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30519.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10669 root 20 0 267228 20420 13640 S 13.3 0.1 1148:21 Pcrystal
10672 root 20 0 267228 20464 13684 S 13.3 0.1 1158:01 Pcrystal
10675 root 20 0 267228 22340 13520 S 13.3 0.1 1148:37 Pcrystal
10667 root 20 0 267228 20284 13504 S 6.7 0.1 1170:54 Pcrystal
10668 root 20 0 267228 22044 13228 S 6.7 0.1 1156:48 Pcrystal
10670 root 20 0 267228 20356 13576 S 6.7 0.1 1150:24 Pcrystal
1 root 20 0 170568 10444 7956 S 0.0 0.0 0:43.95 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.21 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-events_highpri
8 root 0 -20 0 0 0 I 0.0 0.0 0:25.24 kworker/0:1H-events_highpri
9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
10 root 20 0 0 0 0 S 0.0 0.0 0:01.86 ksoftirqd/0
11 root 20 0 0 0 0 I 0.0 0.0 0:31.81 rcu_sched
12 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh
13 root rt 0 0 0 0 S 0.0 0.0 0:01.55 migration/0
15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
16 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
17 root rt 0 0 0 0 S 0.0 0.0 0:01.72 migration/1
18 root 20 0 0 0 0 S 0.0 0.0 0:01.92 ksoftirqd/1
19 root 20 0 0 0 0 I 0.0 0.0 0:03.73 kworker/1:0-mm_percpu_wq
20 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H-kblockd
21 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/2
22 root rt 0 0 0 0 S 0.0 0.0 0:01.76 migration/2
23 root 20 0 0 0 0 S 0.0 0.0 0:02.78 ksoftirqd/2
25 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/2:0H-events_highpri
26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/3
27 root rt 0 0 0 0 S 0.0 0.0 0:01.74 migration/3
28 root 20 0 0 0 0 S 0.0 0.0 0:01.81 ksoftirqd/3
29 root 20 0 0 0 0 I 0.0 0.0 0:03.60 kworker/3:0-mm_percpu_wq
30 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/3:0H-events_highpri
31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/4
32 root rt 0 0 0 0 S 0.0 0.0 0:01.76 migration/4
33 root 20 0 0 0 0 S 0.0 0.0 0:01.84 ksoftirqd/4
35 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/4:0H-events_highpri
36 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/5
37 root rt 0 0 0 0 S 0.0 0.0 0:01.94 migration/5
38 root 20 0 0 0 0 S 0.0 0.0 0:01.66 ksoftirqd/5
39 root 20 0 0 0 0 I 0.0 0.0 9:32.20 kworker/5:0-mm_percpu_wq
This is very bad issue, causing severe money losses, should be addressed asap :fire_engine:
I'm having a similar issue. I got an error while the Pcrystal
process was still running.
..................................................ID37 aiida-825 at root@...with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
This may be caused by using the wrong OPENMPI version. Developers recommend openmpi-2.1.* here and here
Manifested as:
and