Closed euhruska closed 5 years ago
So far, the instances where the pilot did not pull units indicated that EnTK did not submit units. @vivek-bala , is there a quick way to confirm that w/o checking the pilot logfiles? @euhruska : on the RP layer, you can check by grep unit.000000 *.log
in the client session dir - if that comes up empty, no units have been submitted to RP.
`grep unit.000000 *.log
did indeed come up empty. In other successful client sessions it gave many results.
Currently, the situation is: 1st iteration works, but instead of 2nd iteration it gives the previously experienced infinite units pulled: 0
error in remote agent_0.log, but then it stops doing anything:
This happened several times, so this is preventing me to run more than one iteration at a time.
remote logs:
https://drive.google.com/file/d/1NxU27PanDHSyGhg_3nlH31sNuRRkur41/view?usp=sharing
the end of the agent_0.log:
2018-10-21 02:13:34,947: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-21 02:13:35,963: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-21 02:13:36,979: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-21 02:13:37,995: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-21 02:13:39,014: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-21 02:13:40,210: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-21 02:13:40,481: agent_0 : update.0 : MainThread : DEBUG : update.0.child stop called
2018-10-21 02:13:40,989: agent_0 : update.0 : MainThread : DEBUG : update.0.child stop called
2018-10-21 02:13:41,212: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : update.0 stop called
2018-10-21 02:13:41,534: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : update.0 stop called
2018-10-21 02:13:41,536: agent_0 : MainProcess : agent_0.idler._check_units_cb: WARNING : sub component update.0 is invalid
2018-10-21 02:13:41,536: agent_0 : MainProcess : agent_0.idler._check_units_cb: WARNING : component agent_0 is invalid
2018-10-21 02:13:41,538: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : stop agent_0 (24496 : None : agent_0.idler._check_units_cb) [radical.pilot.utils.component.Agent_0.is_valid]
2018-10-21 02:13:41,661: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : parent stops child 24496 -> None [agent_0]
2018-10-21 02:13:41,679: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : publish "terminate" cmd
2018-10-21 02:13:41,680: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : TERM : agent_0 unregister idler agent_0.idler._check_units_cb
2018-10-21 02:13:41,680: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : child calls stop()
2018-10-21 02:13:41,680: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : TERM : agent_0 unregistered idler agent_0.idler._check_units_cb
2018-10-21 02:13:41,680: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : TERM : agent_0 unregister output AGENT_STAGING_INPUT_PENDING
2018-10-21 02:13:41,680: agent_0 : MainProcess : agent_0.idler._check_units_cb: WARNING : input AGENT_STAGING_INPUT_PENDING is not registered
2018-10-21 02:13:41,680: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : TERM : agent_0 unregister idler agent_0.idler._agent_command_cb
2018-10-21 02:13:41,681: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : signal stop for agent_0.idler._check_units_cb - do not join
2018-10-21 02:13:41,681: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : TERM : agent_0 unregistered idler agent_0.idler._agent_command_cb
2018-10-21 02:13:41,681: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : stop lrms <radical.pilot.agent.rm.torque.Torque object at 0x7f816b398a90>
2018-10-21 02:13:41,681: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : no LRMS shutdown hook defined for LaunchMethod APRUN
2018-10-21 02:13:41,681: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : lrms shutdown hook succeeded (APRUN)
2018-10-21 02:13:41,682: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : stopped lrms <radical.pilot.agent.rm.torque.Torque object at 0x7f816b398a90>
2018-10-21 02:13:41,682: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : final state: FAILED (None)
2018-10-21 02:13:41,682: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : ru_finalize_common()
2018-10-21 02:13:41,682: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : TERM : agent_0 unregister publisher log_pubsub
2018-10-21 02:13:41,695: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : unregistered publisher log_pubsub
2018-10-21 02:13:41,695: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : TERM : agent_0 unregister publisher state_pubsub
2018-10-21 02:13:41,708: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : unregistered publisher state_pubsub
2018-10-21 02:13:41,708: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : TERM : agent_0 unregister publisher control_pubsub
2018-10-21 02:13:41,717: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : unregistered publisher control_pubsub
2018-10-21 02:13:41,717: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : agent_0 close prof
2018-10-21 02:13:42,813: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : update.0 stop called
2018-10-21 02:13:42,930: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : child calls stop()
2018-10-21 02:13:42,930: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : signal stop for agent_0.idler._check_units_cb - do not join
2018-10-21 02:13:42,931: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : signal stop for agent_0.idler._check_units_cb - do not join
2018-10-21 02:13:42,945: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-21 02:13:42,945: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : ru_finalize_child (NOOP)
2018-10-21 02:13:42,946: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : ru_finalize_common (NOOP)
2018-10-21 02:13:42,946: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : put message: [agent_0.idler._check_units_cb.thread] terminating
2018-10-21 02:13:42,960: agent_0 : MainProcess : agent_0.idler._agent_command_cb: DEBUG : ru_finalize_child (NOOP)
2018-10-21 02:13:42,960: agent_0 : MainProcess : agent_0.idler._agent_command_cb: DEBUG : ru_finalize_common (NOOP)
2018-10-21 02:13:42,960: agent_0 : MainProcess : agent_0.idler._agent_command_cb: INFO : put message: [agent_0.idler._agent_command_cb.thread] terminating
the pilot is not able to pull any units, it happened again after several submissions without issues See documentation (same issue at bottom of) https://github.com/radical-collaboration/extasy-grlsd/issues/98