tilde-lab / yascheduler

Yet another cloud computing scheduler for the high-throughput cloud scientific simulations
https://mpds.io/search/ab%20initio%20calculations
MIT License
5 stars 4 forks source link

Hetzner: calc engine gets asleep and outputs nothing approx. in 10% cases #11

Open blokhin opened 3 years ago

blokhin commented 3 years ago

Manifested as:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1872 root      20   0  269344  20472  13680 S   7.0   0.1 174:17.89 Pcrystal
 1874 root      20   0  269344  20172  13380 S   6.7   0.1 173:49.32 Pcrystal
 1875 root      20   0  269344  20128  13336 S   6.7   0.1 174:53.19 Pcrystal
 1876 root      20   0  269344  20388  13600 S   6.7   0.1 173:32.07 Pcrystal
 1886 root      20   0  269344  20204  13412 S   6.3   0.1 174:08.48 Pcrystal
 1877 root      20   0  269344  20132  13340 S   6.0   0.1 174:18.18 Pcrystal
 1881 root      20   0  269344  20404  13612 S   5.7   0.1 175:58.19 Pcrystal

and

root@node-dwsxhftb:~# cat /data/20201212_194555_dury/OUTPUT
[node-dwsxhftb][[36298,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-dwsxhftb][[36298,1],7][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-dwsxhftb][[36298,1],7][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
blokhin commented 2 years ago

Continues to occur on about every 20-th calculation on Hetzner:

[node-kcjlndpb][[7780,1],5][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-kcjlndpb][[7780,1],5][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-kcjlndpb][[7780,1],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
blokhin commented 1 year ago

And again now:

root@aiida9:~# yastatus
319   RUNNING
root@aiida9:~# yastatus -v
..................................................ID319 aiida-4727 at root@65.108.215.129:hetzner:data/tasks/20221202_194808_319
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],2][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(19) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],7][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(16) failed: Connection reset by peer (104)
[node-gusrkxef][[30587,1],6][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)

so it's just eating the resources without any useful payload :cry: :cry: :cry:

blokhin commented 1 year ago
$ top
Tasks: 130 total,   1 running, 129 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.6 us,  1.6 sy,  0.0 ni, 96.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31413.5 total,  29556.7 free,    246.1 used,   1610.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30519.9 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                  
10669 root      20   0  267228  20420  13640 S  13.3   0.1   1148:21 Pcrystal                                                                                                                 
10672 root      20   0  267228  20464  13684 S  13.3   0.1   1158:01 Pcrystal                                                                                                                 
10675 root      20   0  267228  22340  13520 S  13.3   0.1   1148:37 Pcrystal                                                                                                                 
10667 root      20   0  267228  20284  13504 S   6.7   0.1   1170:54 Pcrystal                                                                                                                 
10668 root      20   0  267228  22044  13228 S   6.7   0.1   1156:48 Pcrystal                                                                                                                 
10670 root      20   0  267228  20356  13576 S   6.7   0.1   1150:24 Pcrystal                                                                                                                 
    1 root      20   0  170568  10444   7956 S   0.0   0.0   0:43.95 systemd                                                                                                                  
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.21 kthreadd                                                                                                                 
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                                                                                                                   
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                                                                                                               
    6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri                                                                                              
    8 root       0 -20       0      0      0 I   0.0   0.0   0:25.24 kworker/0:1H-events_highpri                                                                                              
    9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                                                                                                             
   10 root      20   0       0      0      0 S   0.0   0.0   0:01.86 ksoftirqd/0                                                                                                              
   11 root      20   0       0      0      0 I   0.0   0.0   0:31.81 rcu_sched                                                                                                                
   12 root      20   0       0      0      0 I   0.0   0.0   0:00.00 rcu_bh                                                                                                                   
   13 root      rt   0       0      0      0 S   0.0   0.0   0:01.55 migration/0                                                                                                              
   15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0                                                                                                                  
   16 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1                                                                                                                  
   17 root      rt   0       0      0      0 S   0.0   0.0   0:01.72 migration/1                                                                                                              
   18 root      20   0       0      0      0 S   0.0   0.0   0:01.92 ksoftirqd/1                                                                                                              
   19 root      20   0       0      0      0 I   0.0   0.0   0:03.73 kworker/1:0-mm_percpu_wq                                                                                                 
   20 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/1:0H-kblockd                                                                                                     
   21 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/2                                                                                                                  
   22 root      rt   0       0      0      0 S   0.0   0.0   0:01.76 migration/2                                                                                                              
   23 root      20   0       0      0      0 S   0.0   0.0   0:02.78 ksoftirqd/2                                                                                                              
   25 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/2:0H-events_highpri                                                                                              
   26 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/3                                                                                                                  
   27 root      rt   0       0      0      0 S   0.0   0.0   0:01.74 migration/3                                                                                                              
   28 root      20   0       0      0      0 S   0.0   0.0   0:01.81 ksoftirqd/3                                                                                                              
   29 root      20   0       0      0      0 I   0.0   0.0   0:03.60 kworker/3:0-mm_percpu_wq                                                                                                 
   30 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/3:0H-events_highpri                                                                                              
   31 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/4                                                                                                                  
   32 root      rt   0       0      0      0 S   0.0   0.0   0:01.76 migration/4                                                                                                              
   33 root      20   0       0      0      0 S   0.0   0.0   0:01.84 ksoftirqd/4                                                                                                              
   35 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/4:0H-events_highpri                                                                                              
   36 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/5                                                                                                                  
   37 root      rt   0       0      0      0 S   0.0   0.0   0:01.94 migration/5                                                                                                              
   38 root      20   0       0      0      0 S   0.0   0.0   0:01.66 ksoftirqd/5                                                                                                              
   39 root      20   0       0      0      0 I   0.0   0.0   9:32.20 kworker/5:0-mm_percpu_wq
blokhin commented 8 months ago

This is very bad issue, causing severe money losses, should be addressed asap :fire_engine:

akvatol commented 3 months ago

I'm having a similar issue. I got an error while the Pcrystal process was still running.

..................................................ID37 aiida-825 at root@...with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795
[node-cnsxffah:02513] PMIX ERROR: UNREACHABLE in file ../../../src/server/pmix_server.c at line 2795

This may be caused by using the wrong OPENMPI version. Developers recommend openmpi-2.1.* here and here