radical-cybertools / radical.owms

Tiered Resource OverlaY
Other
0 stars 1 forks source link

Catching/acting upon errors caught by the pilot layer #49

Open mturilli opened 10 years ago

mturilli commented 10 years ago

Currently, TROY does not catch/act upon errors caught by the pilot layer. Examples:

In both cases, TROY reports pilots in state dispatched and happily runs forever.

andre-merzky commented 10 years ago

The reason is that we happily pull for the workload state, which is 'DISOATCHED' (all CUs are PENDING) -- but never bother to check the overlay state. This is a performance issue, and an issue of programming paradigms. Ideally, we would love to have notifications on pilot state changes, so that troy gets informed when the overlay goes MIA. That will come in saga pilot eventually (is getting closer). W/o notifications, we could alternate between polling pilot state and workload state -- but that makes the examples more complex, and adds up latencies.

Either way, I agree that this needs adressing...

mturilli commented 10 years ago

Hi Andre, Many thanks for the insightful details. Do you have a timeline for the implementation of notifications in sagapilot? In case of a long timeline, we may want to see whether we need to evaluate in detail the the latency+complexity overhead of the alternative, pull-based approach.

On Sun, Feb 16, 2014 at 4:17 PM, Andre Merzky notifications@github.comwrote:

The reason is that we happily pull for the workload state, which is 'DISOATCHED' (all CUs are PENDING) -- but never bother to check the overlay state. This is a performance issue, and an issue of programming paradigms. Ideally, we would love to have notifications on pilot state changes, so that troy gets informed when the overlay goes MIA. That will come in saga pilot eventually (is getting closer). W/o notifications, we could alternate between polling pilot state and workload state -- but that makes the examples more complex, and adds up latencies.

Either way, I agree that this needs adressing...

Reply to this email directly or view it on GitHubhttps://github.com/saga-project/troy/issues/49#issuecomment-35212875 .

Dr Matteo Turilli Department of Electrical and Computer Engineering Rutgers University

andre-merzky commented 10 years ago

Thinking about it, we might need the pulling approach anyways, for bigjob. Will think of something - but probably not over the next week or so. In our call in ~10 days, can we go over the open tickets and prioritize (e.g. sort them into milestones)?

mturilli commented 10 years ago

OK, thank you. Re milestones: sure, this is what I am doing and I would be more than happy to do this altogether.

On Mon, Feb 17, 2014 at 2:29 AM, Andre Merzky notifications@github.comwrote:

Thinking about it, we might need the pulling approach anyways, for bigjob. Will think of something - but probably not over the next week or so. In our call in ~10 days, can we go over the open tickets and prioritize (e.g. sort them into milestones)?

Reply to this email directly or view it on GitHubhttps://github.com/saga-project/troy/issues/49#issuecomment-35233904 .

Dr Matteo Turilli Department of Electrical and Computer Engineering Rutgers University