Timings: Events for pilot are partial/redundant

Here the annotated json from the relevant portion of a session.

I believe this refers to the creation of the pilot description within TROY. Correct?

[
    {
        "tags" : [],
        "event" : "timed_create",
        "name" : "p.0001",
        "time" : ISODate("2014-03-07T04:12:14.159Z")
    },

Start queuing time of the job on the remote DCI? I would join 'sagapilot' and 'submission_time' and use them as value of "name". Something like: "name": "sagapilot-submission_time".

    {
        "tags" : [ 
            "sagapilot", 
            "submission_time"
        ],
        "event" : "monitor",
        "name" : "submission",
        "time" : ISODate("2014-03-07T04:12:15.318Z")
    },

End of queuing time and beginning of the job runtime? I would join 'sagapilot' and 'start_time' and use them as value of "name". Something like: "name": "sagapilot-start_time".

    {
        "tags" : [ 
            "sagapilot", 
            "start_time"
        ],
        "event" : "monitor",
        "name" : "start",
        "time" : ISODate("2014-03-07T06:23:41.155Z")
    },

The following have no timings. Luckily, they should be irrelevant for any practical use within TROY and should probably be removed.

    {
        "tags" : [ 
            "sagapilot", 
            "Created agent sandbox 'sftp://stampede.tacc.utexas.edu/home1/02855/mturilli/troy_agents/pilot-5319471f8ddf87526eaad625/'."
        ],
        "event" : "monitor",
        "name" : "state_detail",
        "time" : null
    },
    {
        "tags" : [ 
            "sagapilot", 
            "Copied 'file://localhost//N/u/mturilli/Virtualenvs/AIMES/bin/bootstrap-and-run-agent' script to agent sandbox."
        ],
        "event" : "monitor",
        "name" : "state_detail",
        "time" : null
    },
    {
        "tags" : [ 
            "sagapilot", 
            "Copied 'file://localhost//N/u/mturilli/Virtualenvs/AIMES/lib/python2.7/site-packages/sagapilot/agent/sagapilot-agent.py' script to agent sandbox."
        ],
        "event" : "monitor",
        "name" : "state_detail",
        "time" : null
    },
    {
        "tags" : [ 
            "sagapilot", 
            "ComputePilot agent successfully submitted with JobID '[slurm+ssh://stampede.tacc.utexas.edu]-[2939542]'"
        ],
        "event" : "monitor",
        "name" : "state_detail",
        "time" : null
    }
]

We are missing a timestamp for the end of the pilot runtime. AFAIK, sagapilot does log such an event, can we add it to the event lists?

timed_create should be gone by now -- does it still appear?

merzky@cameo:~/saga/troy (devel %=) $ grep -ilr timed_create troy
merzky@cameo:~/saga/troy (devel %=) $

It refers to the creation of the troy.Pilot object instance.

"tags" : [ 
            "sagapilot", 
            "submission_time"
        ],

Those tags are separate on purpose -- eventually, I would like to grep for tags from different sources (sagapilot, bigjob, troy, saga-python, ...), and would also like to be able to grep for their original name (submission_time in this case is what sagapilot names it), without the need for string splitting or aggregation etc. For most practical purposes though, i.e. on troy level, the 'tags' should not be relevant. I hope.

no timings: ah, that is a remnant -- state_details does not exist anymore on saga-pilot level. I'll remove them...

pilot end times: I think the overlay has not been cancelled at this point, thus no end times. Should I change the strategy in that respect?

state_detail is gone now.

Thank you Andre. Please feel free to keep whatever you find useful in the tag 'space'. As far as we have a reasonable 'name' set to identify the (type of) event we are logging I have no issues.

pilot end times: I think the overlay has not been cancelled at this point, thus no end times. Should I change the strategy in that respect?

I am not sure I understand what you are saying here. In my mental model, an overlay is a set of pilots. Pilots may live and die within the timespan of their overlay(s). Dynamically - i.e. on the timeline, an overlay starts before any of its pilot start-time and ends after any of its pilot end-time. Is that consistent with your representation?

Hi Matteo,

re pilot end times: yes, we are in agreement about relation of overlay and pilot life times. In this case, however, the workload has finished and the pilots are still alive, when the strategy plugin decides to store the timings to the database. To get an end time for the pilots, we need to either wait 'til the pilots die of old age, or need to perform a mercy killing, aehm, overlay.cancel()... Or are you happy with the information that the pilots simply are alive at this point, when the workload is done, and get eventually killed during troy shutdown?

I see what you mean.

Are we going to assume that pilots can survive an overlay?
Do we want to shift scheduled/running pilots across overlays (both at the same time or sequentially, one overlay after another overlay)?
Do we want a manager for each overlay or a 'persistent' manager that manages all the overlays (and their pilots)?

All questions that came to mind when considering whether overlay.cancel() should set the end-time of all the pilots of that overlay. I am sure there are more :)

Pragmatically, for now I will take the end time of an overlay as the end time of all its pilots. Can I have the end of an overlay recorded as an event? Looking at the following snippet, I guess it would be the ISODate of "stop" in for teh event "cancel" under duration. Oddly, that event is not added to the list of events (alongside schedule and provision):

{
    "_id" : "ol.0001",
    "durations" : [
        {
            "duration" : 0.009608,
            "start" : ISODate("2014-03-07T04:12:14.154Z"),
            "stop" : ISODate("2014-03-07T04:12:14.164Z"),
            "event" : "translate",
            "tags" : []
        },
        {
            "duration" : 0.002075,
            "start" : ISODate("2014-03-07T04:12:14.165Z"),
            "stop" : ISODate("2014-03-07T04:12:14.167Z"),
            "event" : "schedule",
            "tags" : []
        },
        {
            "duration" : 2.186032,
            "start" : ISODate("2014-03-07T04:12:14.174Z"),
            "stop" : ISODate("2014-03-07T04:12:16.360Z"),
            "event" : "provision",
            "tags" : []
        },
        {
            "duration" : 0.087283,
            "start" : ISODate("2014-03-07T06:30:02.806Z"),
            "stop" : ISODate("2014-03-07T06:30:02.893Z"),
            "event" : "cancel",
            "tags" : []
        }
    ],
    "type" : "troy.Overlay",
    "events" : [
        {
            "tags" : [],
            "event" : "timed_create",
            "name" : "ol.0001",
            "time" : ISODate("2014-03-07T04:12:13.356Z")
        },
        {
            "tags" : [],
            "event" : "state",
            "name" : "Described",
            "time" : ISODate("2014-03-07T04:12:13.359Z")
        }
    ],

Are we going to assume that pilots can survive an overlay?

At the moments, pilots can survive an overlay. I think your mental model though was that pilots should get canceled when the overlay object gets (explicitly or implicitly) deleted, right?

Do we want to shift scheduled/running pilots across overlays (both at the same time or sequentially, one overlay after another overlay)?

This is troy version 2 or 3 IMHO :) But yes, that would imply that pilots can survive overlays.

Do we want a manager for each overlay or a 'persistent' manager that manages all the overlays (and their pilots)?

I am not sure I understand that part :/ At the moment, an overlay manager can manage any number of overlays -- but is not persistent. (session) persistency was planned for a later version, right?

Re timings: let me just change the strategy so that the overlay is canceled and the pilots are dead when we store timings - then you'll see all the states... :)

Here a first analysis of the timings we get from TROY (and sagapilot). I think we are getting there :) Warning: This post must be read on the github website due to formatting requirements.

Details of the run of the AIMES experiment I will use to illustrate the analysis:

X000ST00SP00-L000WRR0ORR0100-000100010512-GROM0001

Resources   : XSEDE, Stampede
Pilot system: sagapilot
Strategy    : Late Binding; Round Robin Workload scheduler;
              Round Robin Overlay scheduler; Concurrency 100%; 1 pilot
Application : 1 workload, bag of 512 tasks, gromacs
Run #       : 1

We look at the following events within the session logged as a (enormous and huber nested) JSON file. Here a purified snippet:

Events logs
#------------------------------------------------------------------------------
"_id" : "X000ST00SP00-L000WRR0ORR0100-000100010512-GROM0001-2014-03-08.06:12:46.0001",
            "name" : "Created",
            "time" : ISODate("2014-03-08T06:12:46.638Z")

            "name" : "Dumped",
            "time" : ISODate("2014-03-08T06:37:07.210Z")

#------------------------------------------------------------------------------
"_id" : "planner.0001",
            "name" : "Created",
            "time" : ISODate("2014-03-08T06:12:47.533Z")

#------------------------------------------------------------------------------
"_id" : "wlm.0001",
            "name" : "Created",
            "time" : ISODate("2014-03-08T06:12:47.534Z")

#------------------------------------------------------------------------------
"_id" : "wl.0001",
            "name" : "Created",
            "time" : ISODate("2014-03-08T06:12:47.631Z")

            "name" : "Described",
            "time" : ISODate("2014-03-08T06:12:47.632Z")

            "name" : "Translated",
            "time" : ISODate("2014-03-08T06:13:18.340Z")

#------------------------------------------------------------------------------
"_id" : "wl.0002",
            "name" : "Created",
            "time" : ISODate("2014-03-08T06:13:00.597Z")

            "name" : "Described",
            "time" : ISODate("2014-03-08T06:13:00.598Z")

            "name" : "Translated",
            "time" : ISODate("2014-03-08T06:13:18.341Z")

            "name" : "Bound",
            "time" : ISODate("2014-03-08T06:13:20.491Z")

            "name" : "Dispatched",
            "time" : ISODate("2014-03-08T06:16:14.095Z")

            "name" : "Done",
            "time" : ISODate("2014-03-08T06:19:43.130Z")

#------------------------------------------------------------------------------
"_id" : "ol.0001",
            "name" : "Created",
            "time" : ISODate("2014-03-08T06:13:02.052Z")

            "name" : "Described",
            "time" : ISODate("2014-03-08T06:13:02.054Z")

            "name" : "Canceled",
            "time" : ISODate("2014-03-08T06:37:07.209Z")

#------------------------------------------------------------------------------
"_id" : "p.0001",
            "name" : "Created",
            "time" : ISODate("2014-03-08T06:13:02.108Z")

            "name" : "Described",
            "time" : ISODate("2014-03-08T06:13:02.111Z")

            "name" : "Bound",
            "time" : ISODate("2014-03-08T06:13:02.114Z")

            "name" : "submission",
            "time" : ISODate("2014-03-08T06:13:02.724Z")

            "name" : "start",
            "time" : ISODate("2014-03-08T06:14:00.041Z")

            "name" : "Canceled",
            "time" : ISODate("2014-03-08T06:37:07.208Z")

From here, we order all the events by their timings and we get a temporally order sequence of events:

Events   : {Create session, create planner, create workload manager,
sequence |  create workload (container), describe workload (container),
         |  create workload (application), describe workload (application),
         |  create overlay, describe overlay, create pilot, describe pilot,
         |  bound pilot, submit pilot, translate workload (container),
         |  translate workload (application), bound workload (application),
         |  start pilot, dispatch workload (application),
         |  complete workload (application), cancel pilot, cancel overlay,
         |  dump session}

That we can (manually) plot the data. Scroll on the right to see the whole plot. Note that the time resolution of the first plot is 2s while the one of the second is 20s:

  |
  | session created                                                                                                                (6:12:46.638)
  |
  | planner created                                                                                                                (6:12:47.533)
  | workload manager created                                                                                                       (6:12:47.534)
  |
  | worload 0001 created                                                                                                           (6:12:47.631)
  | worload 0001 described                                                                                                         (6:12:47.632)
  |
  |       workload 0002 created                                                                                                    (6:13:00.597)
  |       workload 0002 described                                                                                                  (6:13:00.598)
  |
E |        overlay created                                                                                                         (6:13:02.052)
V |        overlay described                                                                                                       (6:13:02.054)
E |
N |        pilot created                                                                                                           (6:13:02.108)
T |        pilot Described                                                                                                         (6:13:02.111)
S |        pilot Bound                                                                                                             (6:13:02.114)
  |        pilot submission                                                                                                        (6:13:02.724)
  |
  |                worload 0001 translated                                                                                         (6:13:18.340)
  |                workload 0002 translated                                                                                        (6:13:18.341)
  |                 workload 0002 bound                                                                                            (6:13:20.491)
  |
  |                                     pilot start                                                                                (6:14:00.041)
  |
  |                                                                                                          workload 0002 dsptchd (6:16:14.095)
  |
--|-|------|-------+--------------------|--------------------------------------------------------------------|-----------------------------------
      TROY     W                                sagapilot agent setup + TROY|sagapilot scheduling +                            TIME (1step = 2s)
      setup  binding                                             data stage in (134s)
      (16s)   (18s)
                  job queing time (58s)

  | workload 0002 dispatched                                                       (6:16:14.095)
  |           workload 0002 done                                                   (6:19:43.130)
  |
  |                                                               pilot Canceled   (6:37:07.208)
  |                                                               overlay canceled (6:37:07.209)
  |                                                               session dumped   (6:37:07.210)
  |
--|-|---------|---------------------------------------------------|------------------------------
        W                   data stage out (1044s)                            TIME (1step = 20s)
    execution
      (209s)

Some considerations and questions:

Data staging dominates the time to completion of this run (~20m Vs ~3.5m execution time). I don't know what is the ratio [compute time/output file size] for gromacs. Would it be constant, the time currently spent in moving data would become irrelevant (10% of the run time) with runs of about ~3.3 hours. I doubt that ration is constant so, for gromacs, I am not sure we can assume that data staging is always irrelevant compared to runtime.
Data staging to the DCI costs around 134s while stage out takes around 1044s. Interestingly, this two numbers tell us nothing without knowing at least the amount of data staged in both directions (for this run those amounts are: > 4.75MB | < 1GB). Shall we add this to TROY logs?
Data stage towards the DCI can be cut to 1/512th by transferring the file only once. See the ticket opened on the sagapilot tracker about copying Vs linking.
Data staging should probably be recorded as two separate chapters in the json logs. Stage_in, Stage_out with their own durations and events. At the moment, we probably have just start/end but we might have soon "replicated" and much more.
workload/overlay/pilot create events are redundant. I can easily disregard them but I would keep them into the logs because they could be useful for other analyses. For example, if something goes wrong with the code, it might be possible to pinpoint the bottleneck thanks to discrepancies between creating and describing.
Cancel events are missing for: planner, workload manager, wl0001. I do not need them but I would add them for the same reason for which I would keep the create events.
The separation between wl0001 and wl0002 is spurious. The application has only one workload. wl0002 should probably be renamed to par.0001.
Events should be derived for wl.0001 from aggregating events of each partition - i.e., a workload is described/translated/bound/dispatched/done when all its partitions are describedtranslated/bound/dispatched/done.
Does the stage in of the input file(s) start only after the pilot has started or is it done concurrently? These data seems to indicate the former.
Binding happens after the pilot has been submitted (to the remote DCI). I am running an early binding sample of this run but, while I wait, is the early binding going to happen right after "pilot described"?
The timing of the events can be aggregated so to derive the timings for some untracked events. In the graph we see:

Subevents: {Set up TROY, Bind Workload, {setup agent, stage data in,
sequence |  scheduler CUs}, execute workload, stage data out}

TROY setup (i.e. overhead) down to 16 seconds. Great! Queuing time very short this time, yay!. Do you agree with these aggregations? Do you see other potentially interesting aggregations?

I noticed that the following durations can be used to check/derive some of the aggregated events. I am not sure where to place the binding of the workload to a pilot though. I am pretty sure it is 'schedule' but what dispatch stays for?

subevents logs:

#------------------------------------------------------------------------------
"_id" : "wl.0002",
            "duration" : 0.354249,          BINDING?
            "event" : "schedule",

            "duration" : 136.664604,
            "event" : "stage_in_workload",  DATA STAGE IN  (> DCI)?

            "duration" : 18.523276,
            "event" : "dispatch",

            "duration" : 1044.063511,       DATA STAGE OUT (< DCI)?
            "event" : "stage_out_workload",

Should we create new events instead of going to fish for durations?
More generally, do you see any other event we should record?

That's all for now and thank you for reading all this!

Hi Matteo,

let me reply via mail -- that makes interleaved responses easier. We'll see how that shows up in the ticket...

On Sat, Mar 8, 2014 at 7:32 PM, mturilli notifications@github.com wrote:

Here a first analysis of the timings we get from TROY (and sagapilot). I think we are getting there :) Warning: This post must be read on the github website due to formatting requirements.

Reads fine via mail for me, FWIW. Nice graphs / analysis :)

Details of the run of the AIMES experiment I will use to illustrate the analysis:

[...]

Some considerations and questions:

Data staging dominates the time to completion of this run (~20m Vs ~3.5m execution time). I don't know what is the ratio [compute time/output file size] for gromacs. Would it be constant, the time currently spent in moving data would become irrelevant (10% of the run time) with runs of about ~3.3 hours. I doubt that ration is constant so, for gromacs, I am not sure we can assume that data staging is always irrelevant compared to runtime.

Data staging to the DCI costs around 134s while stage out takes around 1044s. Interestingly, this two numbers tell us nothing without knowing at least the amount of data staged in both directions (for this run those amounts are: > 4.75MB | < 1GB). Shall we add this to TROY logs?

While logging those things seem sensible, we need to get some clarification on what that is about. At the moment, this whole system is really only looking at timing, and does not try to create a trace of what is happening and why. If we want to have something like this, the structure would be very different I think. If it is just timing we want (at this point), then adding file size as metadata to a timing even might be possible, but I am not sure how easy that would be.

Data stage towards the DCI can be cut to 1/512th by transferring the file only once. See the ticket opened on the sagapilot tracker about copying Vs linking.

Data staging should probably be recorded as two separate chapters in the json logs. Stage_in, Stage_out with their own durations and events. At the moment, we probably have just start/end but we might have soon "replicated" and much more.

I am not sure I understand that part, maybe I miss what 'chapter' means. From the troy perspective, data staging is at this point an operation which happens on workload level. On workload level, we though have 'stage_in_workload' and 'stage_in_workload' as separate metrics recorded... Can you clarify?

workload/overlay/pilot create events are redundant. I can easily disregard them but I would keep them into the logs because they could be useful for other analyses. For example, if something goes wrong with the code, it might be possible to pinpoint the bottleneck thanks to discrepancies between creating and describing.

What do you mean with redundant? They are recorded only once as far as I can see?

Cancel events are missing for: planner, workload manager, wl0001. I do not need them but I would add them for the same reason for which I would keep the create events.

This is because the strategy does not bother to cancel planner and workload manager -- there is no state model behind the managers. So, at best you would see when the object gets destroyed, but that tells you more about python's garbage collection than about troy. Workloads are not canceled when they are Done or Failed -- Done and Failed are final states.

The separation between wl0001 and wl0002 is spurious. The application has only one workload. wl0002 should probably be renamed to par.0001.

If you remember, we introduced workload partitions as an afterthought, and decided not to make them primary objects -- that is why those are workloads within workloads, and ID schema is not easily changed (we could hack it, but that would obfuscate the code to some extent).

Events should be derived for wl.0001 from aggregating events of each partition - i.e., a workload is described/translated/bound/dispatched/done when all its partitions are describedtranslated/bound/dispatched/done.

I would rather see this done as a post-mortem aggregation. During runtime, the relation of a partition to its 'parent' workload is not very relevant, and not really visible to a method which operates on that partition, so moving the events upwards is not easy. For example, the 'schedule_workload' will have a duration and a state transition recorded for the scheduled partition -- but it does not know if that is a partition or a parent workload or a grandparent workload etc. It quacks like a workload ;)

We can keep the structural information around though (we keep many already, I'll add workload/partition relation), so that re-ordering events will be possible.

But we have to be clear what this means. Assume a workload has two partitions, and we first translate/schedule/dispatch part_1, then translate/schedule/dispatch part_2. Now you would have all states and durations doubled on the parent workload.... Also, it is not easy to define the aggregated state -- what if partition_1 is done, and partition 2 is scheduled: what is the aggregated state? 'scheduled' seems... strange?

Does the stage in of the input file(s) start only after the pilot has started or is it done concurrently? These data seems to indicate the former.

stage_in starts when stage_in_workload() is called by the strategy ;) We don't even do state checks at this point, so the strategy can be very flexible -- we just expect the transfer directives to be fully specified, and that may mean that we need a target host, which implies the need for scheduling of both overlay and workload (unless the directives have fully qualified URLs, which usually won't be the case). The current strategies both stage-in after the workload is bound. For both that means that the pilots are also bound -- otherwise we would not have a stage target. For the late binding, stage-in happens before the overlay is dispatched.

- Binding happens after the pilot has been submitted (to the remote DCI). I am running an early binding sample of this run but, while I wait, is the early binding going to happen right after "pilot described"?

I am not sure I understand the above... The late binding strategy does everything on the overlay, up to dispatching, before even translating the workload into CUs. The early binding strategy is, naturally, interleaving overlay and workload activities, and the workload is scheduled before the pilots are dispatched -- but the pilots are scheduled at that point. You may want to look into the strategy plugin code, it should be relatively straight forward in terms of call ordering...

The timing of the events can be aggregated so to derive the timings for some untracked events. In the graph we see:

Subevents: {Set up TROY, Bind Workload, {setup agent, stage data in, sequence | scheduler CUs}, execute workload, stage data out}

TROY setup (i.e. overhead) down to 16 seconds. Great! Queuing time very short this time, yay!. Do you agree with these aggregations? Do you see other potentially interesting aggregations?

Looks good! I am sure there will be others later, but can't think of any specific right now...

I noticed that the following durations can be used to check/derive some of the aggregated events. I am not sure where to place the binding of the workload to a pilot though. I am pretty sure it is 'schedule' but what dispatch stays for?

Dispatch is to push the CUs to the overlay, i.e. to the pilot system's control.

subevents logs:

------------------------------------------------------------------------------

"_id" : "wl.0002", "duration" : 0.354249, BINDING? "event" : "schedule",
        "duration" : 136.664604,
        "event" : "stage_in_workload",  DATA STAGE IN  (> DCI)?

        "duration" : 18.523276,
        "event" : "dispatch",

        "duration" : 1044.063511,       DATA STAGE OUT (< DCI)?
        "event" : "stage_out_workload",
Should we create new events instead of going to fish for durations?

I would not mind to scrap the durations, and to only record events. That would replace, for example:

"duration" : 136.664604, "event" : "stage_in_workload"

with "event" : 100.0, "tag" : "start", "name : "stage_in_workload" "event" : 236,664604, "tag" : "stop", "name : "stage_in_workload"

durations can always be calculated from that, but it would simplify the data structure, and would make it easier to create time lines.

More generally, do you see any other event we should record?

I think we should allow plugins to add custom events.

That's all for now and thank you for reading all this!

Puh, this was long indeed :D

Thanks, interesting feedback!

Andre.

Nothing is really difficult.

Moved to the troy-devel mailing list.

Dr Matteo Turilli Department of Electrical and Computer Engineering Rutgers University

radical-cybertools / radical.owms

Timings: Events for pilot are partial/redundant #76

------------------------------------------------------------------------------