radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Backfilling scheduler backfills erratically on 3 pilots #313

Closed mturilli closed 10 years ago

mturilli commented 10 years ago

Running a 1024 BoT on stampede, trestles, and gordon. Session UID: 5402008c20a6417570206001

Observed behavior: trestles pilot comes online almost immediately. 341 CUs are executed and no stage out happens. RP waits until gordon comes up, it executes 480 CUs (?!), all fail (expected due to the mpi issue). RP waits for stampede to come up, 203 CUs are executed and they successfully execute.

Notes:

andre-merzky commented 10 years ago

Actually, I think the scheduler does ok -- the pilots are not looking good, nor does the CU execution.

From the perspective of the BF scheduler, at ~200s the trestles pilot becomes available, and gets its share of units assigned. As those units never complete in the pilot lifetime, no new units get scheduled in that time. At ~3000s, the gordon pilot becomes active, and also gets its share of units. Again, no unit seems to make it through to a final state, they seem to hang in execution. At ~4300 (and this is hard to see, but you'll find it if you look closely), some of Gordon's units go into FAILED, and gordon (being the only active pilot) is immediately getting new units assigned -- but those seem to go into FAILED immediately, too. at ~4700s, that pilot does, and the stampede pilot becomes available -- it gets the remaining units, and executes them nicely. Those are the ones which get staged out. Stampede gets a smaller share than gordon, because gordon already sucked up more CUs than 1/3rd.

So, scheduling wise thats ok. Trestles and Gordon are not... :(

mturilli commented 10 years ago

OK. Please feel free to change the title of this ticket accordingly.

I am afraid I will have to stop my tests on XSEDE. At the moment RP seems to be working only (and not always) with stampede.

andre-merzky commented 10 years ago

I am going to close this ticket then -- we have separate tickets for gordon (#309) and trestles (#301). I did not see a ticket on blacklight, and also have no login (or at least no access). If there are problems there, too, would you mind opening a separate ticket with some details, please? Also, I wasn't aware that stampede sometimes fails for you, and also could not find a specific ticket. Any details on whats up here?

Ole, Mark, I will need your input on those tickets I'm afraid. I think we all agree that there won't be a release before we have stable operation across XSEDE. I know its weekend, and Mark is offline anyways -- but lets please focus on those issues on Monday...

mturilli commented 10 years ago

I understand it is difficult to keep track of all the tickets/issues we discussed so here some clarifications that might be useful:

andre-merzky commented 10 years ago

Thanks for the clarification on blacklight and stampede. Now I see what you meant with stampede -- that seemed indeed unrelated to RP, fortunately, and by now stampede seems to be stable again.

Re the stalling run: if the minimal sandboxes existed (only containing the bootstrapped and pilot agent), then the transfer of those files via SAGA worked. Its hard to say what (did not) happen(ed) after that.

If RP hangs and you don't see any log activity, please interrupt via CTRL-C and post the complete resulting stacktrace and the application code (if we can run that) -- we may be able to track it down from there.

Thanks!