Checking the "health" of .prof files

jdakka commented 6 years ago

I'm running analytics to measure RP overhead and total execution duration (see cell 7

I noticed that for the last sessions I experience execution durations that are > rp durations

Could you check the "health" of these profs for one of these sessions in the attachment? Btw is there a convenient way to check to make sure the profs look as they should?

null_workload_128_replicas_trial_1.zip

andre-merzky commented 6 years ago

Hi Jumana,

there are actually some checks in place, and are running whenever an analytics session is created. It should warn when profiles are not correctly closed, when states appear out of order, etc. The set of tests is limited though, and definitely would benefit from some attention, but the idea is indeed that analytics does run some sanity checks on the data by default.

As for the attached session: on a first glance, that looks good actually. I'll try to reproduce your analysis script, to dig a little deeper into how the durations are obtained which don't seem to add up.

andre-merzky commented 6 years ago

I am not sure if I am doing the right thing here, but this is what I get:

# ./jumana.py 
rp     :     108.02
execute:     104.11
partial:      34.70
partial:      47.43
partial:      50.20
partial:      38.01

and those values seem consistent? The 'partial' durations don't add up to the total duration - but that would just indicate that those durations overlap, ie. that units of those subsets did execute concurrently. I that not the expected behavior then?

I use the following script: please help me out and check if this correctly reflects the intent from the linked notebook:

#!/usr/bin/env python 

import pprint

import radical.analytics as ra
import radical.pilot     as rp
import radical.utils     as ru

def get_rp_info(pipeline):

    src      ='null_workload_128_replicas_trial_1/rp.session.two.jdakka.017492.0004'
    session  = ra.Session(stype='radical.pilot', src=src)
    units    = session.filter(etype='unit',  inplace=False)
    pilots   = session.filter(etype='pilot', inplace=False)
    exec_dur = units.duration([rp.AGENT_EXECUTING, rp.AGENT_STAGING_OUTPUT_PENDING])
    rp_dur   = units.duration([rp.UMGR_SCHEDULING_PENDING, rp.DONE])

    partial_dur  = []
    sorted_units = sorted(units.list('uid'))
  # print sorted_units

    for x in range(0, pipeline*4, pipeline):
      # print x
      # print x + pipeline
      # print sorted_units[x:x+pipeline]
        subset = units.filter(uid=sorted_units[x:x + pipeline], inplace=False)
        part   = subset.duration([rp.AGENT_EXECUTING, 
                                  rp.AGENT_STAGING_OUTPUT_PENDING])
        partial_dur.append(part)

    return rp_dur, partial_dur, exec_dur

rp_dur, partial_dur, exec_dur = get_rp_info(128)

print 'rp     : %10.2f' % rp_dur
print 'execute: %10.2f' % exec_dur
for part in partial_dur:
    print 'partial: %10.2f' % part

vivek-bala commented 6 years ago

Hey Andre, just to be sure. Could you specify the stack that you used? Also, I think the sum of the partials need to be less than 'rp' duration since there are barriers between each partial segment (@jdakka , correct me if I am wrong).

jdakka commented 6 years ago

@andre-merzky @vivek-bala correct the sum of the task_exec_dur should be less than the rp dur. The stack was:

radical.analytics    : v0.45.2-86-g99480a1@rc-v0.46.3
radical.pilot        : 0.47-v0.46.2-183-g2c92e51@rc-v0.46.3
radical.utils        : 0.47-v0.46-73-gd580ab1@rc-v0.46.3
saga                 : 0.47-v0.46-32-ga2f9ded@HEAD-detached-at-origin-rc-v0.46.3

andre-merzky commented 6 years ago

But the ranges do overlap:

$ ./jumana.py 
rp     :     108.02
execute:     104.11
partial:      34.70 (15296.3 - 15331.0)
partial:      47.43 (15311.3 - 15358.7)
partial:      50.20 (15334.8 - 15385.0)
partial:      38.01 (15362.4 - 15400.4)

script:

#!/usr/bin/env python 

import radical.analytics as ra
import radical.pilot     as rp
import radical.utils     as ru

def get_rp_info(pipeline):

    src      ='null_workload_128_replicas_trial_1/rp.session.two.jdakka.017492.0004'
    session  = ra.Session(stype='radical.pilot', src=src)
    units    = session.filter(etype='unit',  inplace=False)
    pilots   = session.filter(etype='pilot', inplace=False)
    exec_dur = units.duration([rp.AGENT_EXECUTING, rp.AGENT_STAGING_OUTPUT_PENDING])
    rp_dur   = units.duration([rp.UMGR_SCHEDULING_PENDING, rp.DONE])

    partial_dur  = []
    sorted_units = sorted(units.list('uid'))

    for x in range(0, pipeline*4, pipeline):
        subset = units.filter(uid=sorted_units[x:x + pipeline], inplace=False)
        part   = subset.duration([rp.AGENT_EXECUTING, 
                                  rp.AGENT_STAGING_OUTPUT_PENDING])
        starts = sorted(subset.timestamps([rp.AGENT_EXECUTING]))
        ends   = sorted(subset.timestamps([rp.AGENT_STAGING_OUTPUT_PENDING]))
        partial_dur.append([part, starts[0], ends[-1]])

    return rp_dur, partial_dur, exec_dur

rp_dur, partial_dur, exec_dur = get_rp_info(128)

print 'rp     : %10.2f' % rp_dur
print 'execute: %10.2f' % exec_dur
for part, start, end in partial_dur:
    print 'partial: %10.2f (%5.1f - %5.1f)' % (part, start, end)

So either the unit selection in the for loop does not correctly distinguish pipelines, or your do pipelines overlap - I can't judge which one it is. Or timestamps are off of course - do you see any indication for that to be the case?

vivek-bala commented 6 years ago

# Generate pipelines

    for replica in range(replicas):
        for ld in lambdas:
            p = Pipeline()

            for step in workflow:
                s, t = Stage(), NamdTask(name=step, cores=cores_per_pipeline)
               # t.arguments = ['replica_{}/lambda_{}/{}.conf'.format(replica, ld, step), '&>', 'replica_{}/lambda_{}/{}.log'.format(replica, ld, step)]
        s.add_tasks(t)
                p.add_stages(s)

            pipelines.add(p)

It is indeed an ensemble of pipelines. It makes sense that the timings overlap. Jumana, do you agree? I hope I'm looking at the correct script (null_workload_128_replicas_trial_1.zip/ties_barrier_gpuStack.py).

jdakka commented 6 years ago

Yes this is a PoE, that makes more sense, I thought profiles would distinguish between the two patterns. No problem, I'll fix it. Thanks both for looking into this!

vivek-bala commented 6 years ago

Its EoP, not PoE :) The profiles are distinguishing the patterns. In EoP, the durations will overlap and the sum will be more than the total runtime. Whereas, in PoE the durations will not overlap and their sum will be less than the runtime.

Thanks Andre.

jdakka commented 6 years ago

@vivek-bala sorry you're right, I switched it after to PoE and that registered to me as the default.

radical-cybertools / radical.analytics

Checking the "health" of .prof files #59