radical-collaboration / extasy-grlsd

Repository to hold the input data and scripts for the ExTASY gromacs-lsdmap work
1 stars 1 forks source link

not able to pull any units #101

Closed euhruska closed 5 years ago

euhruska commented 5 years ago

the pilot is not able to pull any units, it happened again after several submissions without issues See documentation (same issue at bottom of) https://github.com/radical-collaboration/extasy-grlsd/issues/98

andre-merzky commented 5 years ago

So far, the instances where the pilot did not pull units indicated that EnTK did not submit units. @vivek-bala , is there a quick way to confirm that w/o checking the pilot logfiles? @euhruska : on the RP layer, you can check by grep unit.000000 *.log in the client session dir - if that comes up empty, no units have been submitted to RP.

euhruska commented 5 years ago

`grep unit.000000 *.log did indeed come up empty. In other successful client sessions it gave many results.

euhruska commented 5 years ago

Currently, the situation is: 1st iteration works, but instead of 2nd iteration it gives the previously experienced infinite units pulled: 0 error in remote agent_0.log, but then it stops doing anything: This happened several times, so this is preventing me to run more than one iteration at a time. remote logs: https://drive.google.com/file/d/1NxU27PanDHSyGhg_3nlH31sNuRRkur41/view?usp=sharing

the end of the agent_0.log:

2018-10-21 02:13:34,947: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-21 02:13:35,963: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-21 02:13:36,979: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-21 02:13:37,995: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-21 02:13:39,014: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-21 02:13:40,210: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-21 02:13:40,481: agent_0             : update.0                        : MainThread     : DEBUG   : update.0.child stop called
2018-10-21 02:13:40,989: agent_0             : update.0                        : MainThread     : DEBUG   : update.0.child stop called
2018-10-21 02:13:41,212: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : update.0 stop called
2018-10-21 02:13:41,534: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : update.0 stop called
2018-10-21 02:13:41,536: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: WARNING : sub component update.0 is invalid
2018-10-21 02:13:41,536: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: WARNING : component agent_0 is invalid
2018-10-21 02:13:41,538: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : stop agent_0 (24496 : None : agent_0.idler._check_units_cb) [radical.pilot.utils.component.Agent_0.is_valid]
2018-10-21 02:13:41,661: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : parent stops child  24496 -> None [agent_0]
2018-10-21 02:13:41,679: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : publish "terminate" cmd
2018-10-21 02:13:41,680: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : TERM : agent_0 unregister idler agent_0.idler._check_units_cb
2018-10-21 02:13:41,680: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : child calls stop()
2018-10-21 02:13:41,680: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : TERM : agent_0 unregistered idler agent_0.idler._check_units_cb
2018-10-21 02:13:41,680: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : TERM : agent_0 unregister output AGENT_STAGING_INPUT_PENDING
2018-10-21 02:13:41,680: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: WARNING : input AGENT_STAGING_INPUT_PENDING is not registered
2018-10-21 02:13:41,680: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : TERM : agent_0 unregister idler agent_0.idler._agent_command_cb
2018-10-21 02:13:41,681: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : signal stop for agent_0.idler._check_units_cb - do not join
2018-10-21 02:13:41,681: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : TERM : agent_0 unregistered idler agent_0.idler._agent_command_cb
2018-10-21 02:13:41,681: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : stop    lrms <radical.pilot.agent.rm.torque.Torque object at 0x7f816b398a90>
2018-10-21 02:13:41,681: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : no LRMS shutdown hook defined for LaunchMethod APRUN
2018-10-21 02:13:41,681: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : lrms shutdown hook succeeded (APRUN)
2018-10-21 02:13:41,682: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : stopped lrms <radical.pilot.agent.rm.torque.Torque object at 0x7f816b398a90>
2018-10-21 02:13:41,682: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : final state: FAILED (None)
2018-10-21 02:13:41,682: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : ru_finalize_common()
2018-10-21 02:13:41,682: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : TERM : agent_0 unregister publisher log_pubsub
2018-10-21 02:13:41,695: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : unregistered publisher log_pubsub
2018-10-21 02:13:41,695: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : TERM : agent_0 unregister publisher state_pubsub
2018-10-21 02:13:41,708: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : unregistered publisher state_pubsub
2018-10-21 02:13:41,708: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : TERM : agent_0 unregister publisher control_pubsub
2018-10-21 02:13:41,717: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : unregistered publisher control_pubsub
2018-10-21 02:13:41,717: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : agent_0 close prof
2018-10-21 02:13:42,813: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : update.0 stop called
2018-10-21 02:13:42,930: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : child calls stop()
2018-10-21 02:13:42,930: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : signal stop for agent_0.idler._check_units_cb - do not join
2018-10-21 02:13:42,931: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : signal stop for agent_0.idler._check_units_cb - do not join
2018-10-21 02:13:42,945: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-21 02:13:42,945: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : ru_finalize_child (NOOP)
2018-10-21 02:13:42,946: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : ru_finalize_common (NOOP)
2018-10-21 02:13:42,946: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : put message: [agent_0.idler._check_units_cb.thread] terminating
2018-10-21 02:13:42,960: agent_0             : MainProcess                     : agent_0.idler._agent_command_cb: DEBUG   : ru_finalize_child (NOOP)
2018-10-21 02:13:42,960: agent_0             : MainProcess                     : agent_0.idler._agent_command_cb: DEBUG   : ru_finalize_common (NOOP)
2018-10-21 02:13:42,960: agent_0             : MainProcess                     : agent_0.idler._agent_command_cb: INFO    : put message: [agent_0.idler._agent_command_cb.thread] terminating