radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Fix/subagents #3065

Closed mtitov closed 10 months ago

codecov[bot] commented 11 months ago

Codecov Report

Attention: 18 lines in your changes are missing coverage. Please review.

Comparison is base (89df66b) 43.97% compared to head (a74077f) 44.00%.

Files Patch % Lines
src/radical/pilot/agent/agent_n.py 9.09% 10 Missing :warning:
src/radical/pilot/utils/component_manager.py 0.00% 4 Missing :warning:
src/radical/pilot/session.py 0.00% 3 Missing :warning:
src/radical/pilot/agent/agent_0.py 66.66% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## devel #3065 +/- ## ========================================== + Coverage 43.97% 44.00% +0.02% ========================================== Files 96 96 Lines 10578 10569 -9 ========================================== - Hits 4652 4651 -1 + Misses 5926 5918 -8 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

andre-merzky commented 11 months ago

@mtitov : can you please give this version a try? Thanks!

mtitov commented 11 months ago

@andre-merzky it runs locally, but seems it is the same as it was on Frontier

this starts https://github.com/radical-cybertools/radical.pilot/blob/56b9c3ce9f99b4729ddfa779f6446e16dd6c93de/bin/radical-pilot-agent_n#L65-L66

but it doesn't reach main(..

frontier_client.tar.gz frontier_agent.tar.gz

mtitov commented 10 months ago

@andre-merzky removing ru.daemonize( in bin/radical-pilot-agent_n made sub-agents to start

    main(sid, reg_addr, uid)

    #ru.daemonize(main=main, args=[sid, reg_addr, uid],
    #             stdout='%s.out' % uid, stderr='%s.err' % uid)

but when I tested with 4 sub-agents, only Executor for the 1st sub-agent processed all tasks, for others there was such last log message

1699296639.354 : agent_executing.0001 : 41718 : 140731773671168 : WARNING  : === hb agent_executing.0001 inval rp.session.login13.matitov.019667.0002
``` $ grep " to heartbeat:" ./agent_executing.000* ./agent_executing.0000.log:1699296638.248 : agent_executing.0000 : 29115 : 140737179933568 : INFO : connect pub to heartbeat: tcp://10.128.73.60:46483 ./agent_executing.0000.log:1699296638.349 : agent_executing.0000 : 29115 : 140737179933568 : INFO : connect sub to heartbeat: tcp://10.128.73.60:35813 ./agent_executing.0001.log:1699296638.251 : agent_executing.0001 : 41718 : 140737179933568 : INFO : connect pub to heartbeat: tcp://10.128.73.60:46483 ./agent_executing.0001.log:1699296638.352 : agent_executing.0001 : 41718 : 140737179933568 : INFO : connect sub to heartbeat: tcp://10.128.73.60:35813 ./agent_executing.0002.log:1699296638.245 : agent_executing.0002 : 74231 : 140737179933568 : INFO : connect pub to heartbeat: tcp://10.128.73.60:46483 ./agent_executing.0002.log:1699296638.346 : agent_executing.0002 : 74231 : 140737179933568 : INFO : connect sub to heartbeat: tcp://10.128.73.60:35813 ./agent_executing.0003.log:1699296638.245 : agent_executing.0003 : 46622 : 140737179933568 : INFO : connect pub to heartbeat: tcp://10.128.73.60:46483 ./agent_executing.0003.log:1699296638.346 : agent_executing.0003 : 46622 : 140737179933568 : INFO : connect sub to heartbeat: tcp://10.128.73.60:35813 ``` and here is a full log for one of the executors, that didn't process any tasks ``` $ cat ./agent_executing.0001.log 1699296635.567 : agent_executing.0001 : 41718 : 140735520786176 : INFO : connect pub to state_pubsub: tcp://10.128.73.60:37099 1699296635.669 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : registered publisher for state_pubsub 1699296635.671 : agent_executing.0001 : 41718 : 140735520786176 : INFO : connect pub to control_pubsub: tcp://10.128.73.60:41159 1699296635.771 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : registered publisher for control_pubsub 1699296635.772 : agent_executing.0001 : 41718 : 140735520786176 : INFO : connect sub to control_pubsub: tcp://10.128.73.60:39019 1699296635.880 : agent_executing.0001 : 41718 : 140735520786176 : INFO : Configuring ResourceManager Slurm 1699296635.881 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : RM init from registry 1699296635.882 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : prepare lm SRUN 1699296635.894 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : get input ep: agent_executing_queue 1699296637.015 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : registered input agent_executing.0001.work.AGENT_EXECUTING_PENDING [agent_executing_queue] [None] 1699296637.015 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : agent_executing.0001 register input AGENT_EXECUTING_PENDING: agent_executing.0001.work.AGENT_EXECUTING_PENDING 1699296637.015 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : registered worker work [AGENT_EXECUTING_PENDING] 1699296637.015 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : agent_executing.0001 register output AGENT_STAGING_OUTPUT_PENDING:agent_staging_output_queue 1699296638.142 : agent_executing.0001 : 41718 : 140735520786176 : INFO : connect pub to agent_unschedule_pubsub: tcp://10.128.73.60:45143 1699296638.242 : agent_executing.0001 : 41718 : 140735520786176 : DEBUG : registered publisher for agent_unschedule_pubsub 1699296638.251 : agent_executing.0001 : 41718 : 140737179933568 : INFO : connect pub to heartbeat: tcp://10.128.73.60:46483 1699296638.352 : agent_executing.0001 : 41718 : 140737179933568 : DEBUG : hb agent_executing.0001 create 1699296638.352 : agent_executing.0001 : 41718 : 140737179933568 : DEBUG : hb agent_executing.0001 start 1699296638.352 : agent_executing.0001 : 41718 : 140737179933568 : INFO : connect sub to heartbeat: tcp://10.128.73.60:35813 1699296639.354 : agent_executing.0001 : 41718 : 140731773671168 : WARNING : === hb agent_executing.0001 inval rp.session.login13.matitov.019667.0002 ```
andre-merzky commented 10 months ago

the hb inval message is a red herring - we should remove it actually I guess. It is only meant to say that the hb callback received a heartbeat it does not care about...

EDITED: no idea yet why the other executors don't kick in. Would you mind attaching the pilot sandbox, please?

mtitov commented 10 months ago

@andre-merzky added corresponding sandboxes

no idea yet why the other executors don't kick in

for me it seems that the sub-agent executor (0000), which was ready first, collected all available tasks from the queue. All sub-agent executors were up after agent-0 components, and after all tasks were pushed into executing-queue (even though scheduler pushed task by task)

frontier_4sa_pilot.tar.gz frontier_4sa_client.tar.gz

andre-merzky commented 10 months ago

@andre-merzky added corresponding sandboxes

no idea yet why the other executors don't kick in

for me it seems that the sub-agent executor (0000), which was ready first, collected all available tasks from the queue. All sub-agent executors were up after agent-0 components, and after all tasks were pushed into executing-queue (even though scheduler pushed task by task)

frontier_4sa_pilot.tar.gz frontier_4sa_client.tar.gz

The state transitions to AGENT_EXECUTING all have the same timestamp

$  cat *prof | grep 'advance' | grep ',AGENT_EXECUTING,' | cut -f 1 -d , | sort | uniq -c
    560 1699296638.3370194

so the tasks seem to indeed arrive in a bulk. Either use more tasks so that the scheduler can't place them all at once, or set a smaller bulksize for the executing queue (like, 64 or whatever). Would you give that a try?

mtitov commented 10 months ago

my latest run with 2 active nodes, 4 sub-agents (contain executors) and 1200 tasks (multiple generations of tasks) has the following launching distribution

[matitov@login08.frontier pilot.0000]$ grep "Launching task task" agent_executing.0000.log | wc -l
306
[matitov@login08.frontier pilot.0000]$ grep "Launching task task" agent_executing.0001.log | wc -l
210
[matitov@login08.frontier pilot.0000]$ grep "Launching task task" agent_executing.0002.log | wc -l
212
[matitov@login08.frontier pilot.0000]$ grep "Launching task task" agent_executing.0003.log | wc -l
184

First bulk of tasks (112 -> 1 core per task, 2 nodes with 56 cores each) was collected by the first executor, after that each executor was able to collect every next bulk with 1-3 tasks

p.s. not all tasks were finished, but that could be affected by SRUN limitation on Frontier, didn't investigate this part since was checking sub-agents only

andre-merzky commented 10 months ago

Great! We still may want to consider to change the bulksize though, otherwise we won't get executor scaling on the first task generation. What do you think?

mtitov commented 10 months ago

@andre-merzky yeah, that would be good to have, but do we want to have it in this PR or in a separate one? and would it be a feature of the queue or component?

andre-merzky commented 10 months ago

The queue property would be best to set - I don't think the component bulk size would make any difference in this context. And indeed I would not mind including in this PR, as it also relates to the sub-agent problems discussed here.

mtitov commented 10 months ago

ah, ok, then bulk_size would make sense only if we use sub-agents and for a corresponding queue, thus in our case it is executing-queue. Should then I add "bulk_size" : 50 for agent_executing_queue in agent_default_sa.json ? (since that our default example of using sub-agent with the executor in it)

andre-merzky commented 10 months ago

ah, ok, then bulk_size would make sense only if we use sub-agents and for a corresponding queue, thus in our case it is executing-queue. Should then I add "bulk_size" : 50 for agent_executing_queue in agent_default_sa.json ? (since that our default example of using sub-agent with the executor in it)

I would probably pick 64 :-) But otherwise yes, makes only sense for the sa config.