Closed mtitov closed 10 months ago
@mtitov : can you please give this version a try? Thanks!
@andre-merzky it runs locally, but seems it is the same as it was on Frontier
but it doesn't reach main(..
@andre-merzky removing ru.daemonize(
in bin/radical-pilot-agent_n
made sub-agents to start
main(sid, reg_addr, uid)
#ru.daemonize(main=main, args=[sid, reg_addr, uid],
# stdout='%s.out' % uid, stderr='%s.err' % uid)
but when I tested with 4 sub-agents, only Executor for the 1st sub-agent processed all tasks, for others there was such last log message
1699296639.354 : agent_executing.0001 : 41718 : 140731773671168 : WARNING : === hb agent_executing.0001 inval rp.session.login13.matitov.019667.0002
the hb inval
message is a red herring - we should remove it actually I guess. It is only meant to say that the hb callback received a heartbeat it does not care about...
EDITED: no idea yet why the other executors don't kick in. Would you mind attaching the pilot sandbox, please?
@andre-merzky added corresponding sandboxes
no idea yet why the other executors don't kick in
for me it seems that the sub-agent executor (0000
), which was ready first, collected all available tasks from the queue. All sub-agent executors were up after agent-0 components, and after all tasks were pushed into executing-queue (even though scheduler pushed task by task)
@andre-merzky added corresponding sandboxes
no idea yet why the other executors don't kick in
for me it seems that the sub-agent executor (
0000
), which was ready first, collected all available tasks from the queue. All sub-agent executors were up after agent-0 components, and after all tasks were pushed into executing-queue (even though scheduler pushed task by task)
The state transitions to AGENT_EXECUTING
all have the same timestamp
$ cat *prof | grep 'advance' | grep ',AGENT_EXECUTING,' | cut -f 1 -d , | sort | uniq -c
560 1699296638.3370194
so the tasks seem to indeed arrive in a bulk. Either use more tasks so that the scheduler can't place them all at once, or set a smaller bulksize for the executing queue (like, 64 or whatever). Would you give that a try?
my latest run with 2 active nodes, 4 sub-agents (contain executors) and 1200 tasks (multiple generations of tasks) has the following launching distribution
[matitov@login08.frontier pilot.0000]$ grep "Launching task task" agent_executing.0000.log | wc -l
306
[matitov@login08.frontier pilot.0000]$ grep "Launching task task" agent_executing.0001.log | wc -l
210
[matitov@login08.frontier pilot.0000]$ grep "Launching task task" agent_executing.0002.log | wc -l
212
[matitov@login08.frontier pilot.0000]$ grep "Launching task task" agent_executing.0003.log | wc -l
184
First bulk of tasks (112 -> 1 core per task, 2 nodes with 56 cores each) was collected by the first executor, after that each executor was able to collect every next bulk with 1-3 tasks
p.s. not all tasks were finished, but that could be affected by SRUN limitation on Frontier, didn't investigate this part since was checking sub-agents only
Great! We still may want to consider to change the bulksize though, otherwise we won't get executor scaling on the first task generation. What do you think?
@andre-merzky yeah, that would be good to have, but do we want to have it in this PR or in a separate one? and would it be a feature of the queue or component?
The queue property would be best to set - I don't think the component bulk size would make any difference in this context. And indeed I would not mind including in this PR, as it also relates to the sub-agent problems discussed here.
ah, ok, then bulk_size
would make sense only if we use sub-agents and for a corresponding queue, thus in our case it is executing-queue. Should then I add "bulk_size" : 50
for agent_executing_queue
in agent_default_sa.json
? (since that our default example of using sub-agent with the executor in it)
ah, ok, then
bulk_size
would make sense only if we use sub-agents and for a corresponding queue, thus in our case it is executing-queue. Should then I add"bulk_size" : 50
foragent_executing_queue
inagent_default_sa.json
? (since that our default example of using sub-agent with the executor in it)
I would probably pick 64
:-) But otherwise yes, makes only sense for the sa config.
Codecov Report
Attention:
18 lines
in your changes are missing coverage. Please review.Additional details and impacted files
```diff @@ Coverage Diff @@ ## devel #3065 +/- ## ========================================== + Coverage 43.97% 44.00% +0.02% ========================================== Files 96 96 Lines 10578 10569 -9 ========================================== - Hits 4652 4651 -1 + Misses 5926 5918 -8 ```:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.