radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK does not terminate after user programs are complete #127

Closed Weiming-Hu closed 3 years ago

Weiming-Hu commented 3 years ago

I have encountered this problem once. I'm using EnTK on NCAR Cheyenne. My radical stack info is as follows:

(venv) wuh20@cheyenne4:~> radical-stack 

  python               : 3.7.5
  pythonpath           : 
  virtualenv           : /glade/u/home/wuh20/venv

  radical.entk         : 1.5.5
  radical.gtod         : 1.5.0
  radical.pilot        : 1.5.7-v1.5.7-70-g03956fe3b@fix-mpiexec_mpt
  radical.saga         : 1.5.7
  radical.utils        : 1.5.7

My workflow consists of two stages. I requested 6480 CPUs for 480 minutes (8 hours). However, when the second stage is complete (roughly after 2 hours), EnTK did not terminate as expected, instead it

(venv) wuh20@cheyenne4:~/github/pv-workflow/02_simulate/use-entk> python python_me.py                            
EnTK session: re.session.cheyenne4.wuh20.018589.0005                                  
Creating AppManagerSetting up RabbitMQ system                                 ok                                           
                                                                              ok      
Validating and assigning resource manager                                     ok                                              
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-01.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-02.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-03.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-04.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-05.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-06.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-07.nc exists. Skip generating this file.                       
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-08.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-09.nc exists. Skip generating this file.         
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-10.nc exists. Skip generating this file.              
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-11.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-12.nc exists. Skip generating this file.                      
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-13.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-14.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-15.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-16.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-17.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-18.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-19.nc exists. Skip generating this file.      
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-20.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-21.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-22.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-23.nc exists. Skip generating this file.                       
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-24.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-25.nc exists. Skip generating this file.              
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-26.nc exists. Skip generating this file.              
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-27.nc exists. Skip generating this file.              
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-28.nc exists. Skip generating this file.              
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-29.nc exists. Skip generating this file.              
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-30.nc exists. Skip generating this file.              
Nothing to be done for the analog generation stage!                                                                   
Setting up RabbitMQ system                                                   n/a                                      
new session: [re.session.cheyenne4.wuh20.018589.0005]                          \                                      
database   : [mongodb://hpcw-psu:****@129.114.17.185:27017/hpcw-psu]          ok                                         
create pilot manager                                                          ok                                               
submit 1 pilot(s)                                                                                                              
        pilot.0000   ncar.cheyenne_mpt      6480 cores       0 gpus           ok             
All components created                                                                       
create unit managerUpdate: pipeline.0000 state: SCHEDULING                                                                 
Update: pipeline.0000.stage-power state: SCHEDULING                                                     
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-06.cfg state: SCHEDULING                                         
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-27.cfg state: SCHEDULING                  
[...]

Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-16.cfg state: SCHEDULED
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-11.cfg state: SCHEDULED
Update: pipeline.0000.stage-power state: SCHEDULED
/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/topology.py:162: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: https://pymongo.readthedocs.io/en/stable/faq.html#is-pymongo-fork-safe "MongoClient opened before fork. Create MongoClient only " ok
submit: ########################################################################
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-06.cfg state: SUBMITTING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-27.cfg state: SUBMITTING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-01.cfg state: SUBMITTING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-22.cfg state: SUBMITTING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-12.cfg state: SUBMITTING

[ctrl + C hit here]

close unit manager                                                            ok                                             
Exception in thread Thread-12:                                                                                                 
Traceback (most recent call last):                                                                                             
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 318, in _process_ta
sks                                                                                                                            
    mq_connection.close()                                                                                                      
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 730, in close         
    self._flush_output(self._closed_result.is_ready)                                                                           
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 443, in _flush_output 
    raise exceptions.ConnectionClosed()                                                                                        
pika.exceptions.ConnectionClosed                                                                                               

The above exception was the direct cause of the following exception:                                                           

Traceback (most recent call last):                                                                                             
  File "/glade/u/apps/ch/opt/python/3.7.5/gnu/8.3.0/lib/python3.7/threading.py", line 926, in _bootstrap_inner                 
    self.run()                                                                                                                 
  File "/glade/u/apps/ch/opt/python/3.7.5/gnu/8.3.0/lib/python3.7/threading.py", line 870, in run                              
    self._target(*self._args, **self._kwargs)                                                                                  
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 326, in _process_ta
sks                                                                                                                            
    raise EnTKError(e) from e                                                                                                  
radical.entk.exceptions.EnTKError                                                                                              

Process task-manager:                                                                                                          
Traceback (most recent call last):                                                                                             
  File "/glade/u/apps/ch/opt/python/3.7.5/gnu/8.3.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap         
    self.run()                                                                                                                 
  File "/glade/u/apps/ch/opt/python/3.7.5/gnu/8.3.0/lib/python3.7/multiprocessing/process.py", line 99, in run                 
    self._target(*self._args, **self._kwargs)                                                                                  
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 177, in _tmgr      
    heartbeat_response(mq_channel)                                                                                             
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 113, in heartbeat_r
esponse                                                                                                                        
    mq_channel.basic_get(queue=self._hb_request_q)                                                                             
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2077, in basic_get    
    self._basic_getempty_result.is_ready)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
    *waiters)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 458, in _flush_output
    self._impl.ioloop.poll()
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/select_connection.py", line 495, in poll
    self._poller.poll()
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/select_connection.py", line 1102, in poll
    events = self._poll.poll(self._get_max_wait())
KeyboardInterrupt
wait for 1 pilot(s)
              0                                                               ok
closing session re.session.cheyenne4.wuh20.018589.0005                         \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ re.session.cheyenne4.wuh20.018589.0005 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 26373.7s                                                    ok
All components terminated

EnTK was stuck at the submitting stage but the tasks were complete with all the output ready. At a certain point, when I went back on Cheyenne to check the progress, I found that my data were already generated but EnTK was still running. When I ssh into one of the compute nodes, I notice several rp.* processes were sleeping. Please see the following output from top on one of the compute nodes when EnTK was stuck.

top - 21:45:05 up 41 days,  7:57,  1 user,  load average: 0.07, 0.07, 0.04
Tasks: 738 total,   1 running, 737 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  65507008 total, 15499832 used, 50007176 free,        0 buffers
KiB Swap:        0 total,        0 used,        0 free.  3747756 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND                                     
 3162 wuh20     20   0  598648  19904   7476 S  0.664 0.030   0:32.48 rp.agent_stagin                             
 3096 wuh20     20   0 1297296  39660  10192 S  0.332 0.061   2:21.38 rp.agent.0                                  
 3154 wuh20     20   0  598648  19812   7604 S  0.332 0.030   0:30.41 rp.agent_schedu                             
 3182 wuh20     20   0  598648  19380   7192 S  0.332 0.030   0:33.35 rp.funcs_res_qu                             
 3309 wuh20     20   0 1224428  36188   7676 S  0.332 0.055   0:34.67 rp.agent_schedu                             
 3357 wuh20     20   0 1086900  36468   8280 S  0.332 0.056   0:38.32 rp.update.0000                              
45669 wuh20     20   0   16564   2400   1312 R  0.332 0.004   0:00.07 top                                         
 2447 wuh20     20   0   13128   2000   1648 S  0.000 0.003   0:00.00 bash                                        
 2453 wuh20     20   0   10708    472    356 S  0.000 0.001   0:00.02 pbs_demux                                   
 2457 wuh20     20   0   13260   1868   1512 S  0.000 0.003   0:00.00 bash                                        
 2459 wuh20     20   0   13928   2848   1740 S  0.000 0.004   0:02.54 bash                                        
 2552 wuh20     20   0   38724   5792   5000 S  0.000 0.009   0:01.37 ssh                                         
 3150 wuh20     20   0  603512  25532   7492 S  0.000 0.039   0:33.67 rp.agent_execut                             
 3158 wuh20     20   0  598648  19656   7272 S  0.000 0.030   0:33.18 rp.agent_schedu                             
 3166 wuh20     20   0  598904  19868   7368 S  0.000 0.030   0:32.62 rp.agent_stagin                             
 3170 wuh20     20   0  598648  19656   7472 S  0.000 0.030   0:29.89 rp.agent_unsche                             
 3174 wuh20     20   0  598648  20156   7668 S  0.000 0.031   0:30.26 rp.control_pubs                             
 3178 wuh20     20   0  598648  19740   7524 S  0.000 0.030   0:33.52 rp.funcs_req_qu                             
 3186 wuh20     20   0  598648  19824   7604 S  0.000 0.030   0:29.88 rp.state_pubsub                             
 3293 wuh20     20   0 1163204  41516   7644 S  0.000 0.063   0:45.16 rp.agent_execut                             
 3325 wuh20     20   0 1020596  35220   7732 S  0.000 0.054   0:32.64 rp.agent_stagin                             
 3341 wuh20     20   0  951212  35368   7840 S  0.000 0.054   0:32.65 rp.agent_stagin                             
 3431 wuh20     20   0  806604  36244   7504 S  0.000 0.055   0:20.43 rp.agent_schedu                             
45598 wuh20     20   0   75032   3196   2404 S  0.000 0.005   0:00.00 sshd                                        
45599 wuh20     20   0   17204   5640   2092 S  0.000 0.009   0:00.04 bash                                        
45693 wuh20     20   0    5764   1336   1244 S  0.000 0.002   0:00.00 sleep                                       

I then terminated EnTK manually.

There are no error messages in the client sandbox when I run grep -R Error *; there are no error messages either in the server sandbox when I run grep -R Error *.log.

I'm not sure whether this issue is related to another connection issue I opened at EnTK. This is my second time finishing an entire workflow. For the first time, I didn't have this problem though.

Thank you all for your help.

lee212 commented 3 years ago

@Weiming-Hu , I wanted to tell you that I will be following this up along with the open ticket: https://github.com/radical-cybertools/radical.entk/issues/509, as this needs to be addressed in the use of pika. We anticipate this will be resolved no later than the January release.

Weiming-Hu commented 3 years ago

Sounds good to me. I will be actively monitoring this thread. Thanks.