Closed jdakka closed 6 years ago
Hi Jumana,
there are actually some checks in place, and are running whenever an analytics session is created. It should warn when profiles are not correctly closed, when states appear out of order, etc. The set of tests is limited though, and definitely would benefit from some attention, but the idea is indeed that analytics does run some sanity checks on the data by default.
As for the attached session: on a first glance, that looks good actually. I'll try to reproduce your analysis script, to dig a little deeper into how the durations are obtained which don't seem to add up.
I am not sure if I am doing the right thing here, but this is what I get:
# ./jumana.py
rp : 108.02
execute: 104.11
partial: 34.70
partial: 47.43
partial: 50.20
partial: 38.01
and those values seem consistent? The 'partial' durations don't add up to the total duration - but that would just indicate that those durations overlap, ie. that units of those subsets did execute concurrently. I that not the expected behavior then?
I use the following script: please help me out and check if this correctly reflects the intent from the linked notebook:
#!/usr/bin/env python
import pprint
import radical.analytics as ra
import radical.pilot as rp
import radical.utils as ru
def get_rp_info(pipeline):
src ='null_workload_128_replicas_trial_1/rp.session.two.jdakka.017492.0004'
session = ra.Session(stype='radical.pilot', src=src)
units = session.filter(etype='unit', inplace=False)
pilots = session.filter(etype='pilot', inplace=False)
exec_dur = units.duration([rp.AGENT_EXECUTING, rp.AGENT_STAGING_OUTPUT_PENDING])
rp_dur = units.duration([rp.UMGR_SCHEDULING_PENDING, rp.DONE])
partial_dur = []
sorted_units = sorted(units.list('uid'))
# print sorted_units
for x in range(0, pipeline*4, pipeline):
# print x
# print x + pipeline
# print sorted_units[x:x+pipeline]
subset = units.filter(uid=sorted_units[x:x + pipeline], inplace=False)
part = subset.duration([rp.AGENT_EXECUTING,
rp.AGENT_STAGING_OUTPUT_PENDING])
partial_dur.append(part)
return rp_dur, partial_dur, exec_dur
rp_dur, partial_dur, exec_dur = get_rp_info(128)
print 'rp : %10.2f' % rp_dur
print 'execute: %10.2f' % exec_dur
for part in partial_dur:
print 'partial: %10.2f' % part
Hey Andre, just to be sure. Could you specify the stack that you used? Also, I think the sum of the partials need to be less than 'rp' duration since there are barriers between each partial segment (@jdakka , correct me if I am wrong).
@andre-merzky @vivek-bala correct the sum of the task_exec_dur should be less than the rp dur. The stack was:
radical.analytics : v0.45.2-86-g99480a1@rc-v0.46.3
radical.pilot : 0.47-v0.46.2-183-g2c92e51@rc-v0.46.3
radical.utils : 0.47-v0.46-73-gd580ab1@rc-v0.46.3
saga : 0.47-v0.46-32-ga2f9ded@HEAD-detached-at-origin-rc-v0.46.3
But the ranges do overlap:
$ ./jumana.py
rp : 108.02
execute: 104.11
partial: 34.70 (15296.3 - 15331.0)
partial: 47.43 (15311.3 - 15358.7)
partial: 50.20 (15334.8 - 15385.0)
partial: 38.01 (15362.4 - 15400.4)
script:
#!/usr/bin/env python
import radical.analytics as ra
import radical.pilot as rp
import radical.utils as ru
def get_rp_info(pipeline):
src ='null_workload_128_replicas_trial_1/rp.session.two.jdakka.017492.0004'
session = ra.Session(stype='radical.pilot', src=src)
units = session.filter(etype='unit', inplace=False)
pilots = session.filter(etype='pilot', inplace=False)
exec_dur = units.duration([rp.AGENT_EXECUTING, rp.AGENT_STAGING_OUTPUT_PENDING])
rp_dur = units.duration([rp.UMGR_SCHEDULING_PENDING, rp.DONE])
partial_dur = []
sorted_units = sorted(units.list('uid'))
for x in range(0, pipeline*4, pipeline):
subset = units.filter(uid=sorted_units[x:x + pipeline], inplace=False)
part = subset.duration([rp.AGENT_EXECUTING,
rp.AGENT_STAGING_OUTPUT_PENDING])
starts = sorted(subset.timestamps([rp.AGENT_EXECUTING]))
ends = sorted(subset.timestamps([rp.AGENT_STAGING_OUTPUT_PENDING]))
partial_dur.append([part, starts[0], ends[-1]])
return rp_dur, partial_dur, exec_dur
rp_dur, partial_dur, exec_dur = get_rp_info(128)
print 'rp : %10.2f' % rp_dur
print 'execute: %10.2f' % exec_dur
for part, start, end in partial_dur:
print 'partial: %10.2f (%5.1f - %5.1f)' % (part, start, end)
So either the unit selection in the for loop does not correctly distinguish pipelines, or your do pipelines overlap - I can't judge which one it is. Or timestamps are off of course - do you see any indication for that to be the case?
# Generate pipelines
for replica in range(replicas):
for ld in lambdas:
p = Pipeline()
for step in workflow:
s, t = Stage(), NamdTask(name=step, cores=cores_per_pipeline)
# t.arguments = ['replica_{}/lambda_{}/{}.conf'.format(replica, ld, step), '&>', 'replica_{}/lambda_{}/{}.log'.format(replica, ld, step)]
s.add_tasks(t)
p.add_stages(s)
pipelines.add(p)
It is indeed an ensemble of pipelines. It makes sense that the timings overlap. Jumana, do you agree? I hope I'm looking at the correct script (null_workload_128_replicas_trial_1.zip/ties_barrier_gpuStack.py).
Yes this is a PoE, that makes more sense, I thought profiles would distinguish between the two patterns. No problem, I'll fix it. Thanks both for looking into this!
Its EoP, not PoE :) The profiles are distinguishing the patterns. In EoP, the durations will overlap and the sum will be more than the total runtime. Whereas, in PoE the durations will not overlap and their sum will be less than the runtime.
Thanks Andre.
@vivek-bala sorry you're right, I switched it after to PoE and that registered to me as the default.
I'm running analytics to measure RP overhead and total execution duration (see cell 7
I noticed that for the last sessions I experience execution durations that are > rp durations
Could you check the "health" of these profs for one of these sessions in the attachment? Btw is there a convenient way to check to make sure the profs look as they should?
null_workload_128_replicas_trial_1.zip