Open GoogleCodeExporter opened 9 years ago
Some insights:
In core.py:1240 there is the loop:
for task_index, resource_name in sched:
# OMISSIS
try:
self._core.submit(task, targets=[resource])
if self._store:
self._store.save(task)
self._in_flight.append(task)
transitioned.append(task_index)
if isinstance(task, Application):
currently_submitted += 1
currently_in_flight += 1
sched.send(task.execution.state)
The `first_come_first_serve` function in core.py:922 the loop however is:
for task_idx, task in enumerate(tasks):
# OMISSIS
targets = matchmaker.rank(task, compatible_resources)
# now try submission of the task to each resource until one succeeds
for target in targets:
result = yield (task_idx, target.name)
if result != Run.State.NEW:
# task accepted by resource, continue with next task
break
Assuming the first task is a TaskCollection, which cannot be submitted
by itself, and assuming 3 resources are defined, the scheduler will
yield:
* (0, resource1)
* (0, resource2)
* (0, resource3)
unless it is stopped by a `send`, which happens only if the task is
not in NEW state anymore.
In warholize however this is the result:
the `transitioned` list will be equal to
[0, 0, 0]
while `self._in_flight` will contain:
[self._new[0], self._new[0], self._new[0]]
This because after submitting the first ParallelTaskCollection, for
some reason, the state is still NEW. I guess this is because the tasks
in the ParallelTaskCollection are already in `queue`, which in this
case is exactly `self._new`. This is done in core.py:1068
if not _contained(task, queue):
queue.append(task)
task.attach(self)
If this analysis is correct, the only thing to change is to ensure
that we do not add a task in self._in_flight if it is already there,
and not appending the index to transitioned if it is already there.
I would change, therefore, the lines core.py:1248-1249 from:
self._core.submit(task, targets=[resource])
if self._store:
self._store.save(task)
self._in_flight.append(task)
transitioned.append(task_index)
to
self._core.submit(task, targets=[resource])
if self._store:
self._store.save(task)
if task_index not in transitioned:
self._in_flight.append(task)
transitioned.append(task_index)
Original comment by antonio....@gmail.com
on 21 Oct 2013 at 8:15
This patch makes the warholize example work.
Original comment by antonio....@gmail.com
on 21 Oct 2013 at 8:39
Attachments:
Thanks for the code analysis and the patch! I have no objections to
applying it as an interim solution.
I think the patch works around the problem, but the root issue is a
bit deeper and lies in a different contract for `Core.submit()`
assumed by the `Engine` and `Scheduler` objects.
The `Engine`, as it was before applying the new scheduler code and
partially is now, assumes that `Core.submit()` will *always* throw an
exception if something failed; in other words, exception-less submit()
resulted in a submitted job, regardless of job state.
However, the scheduler code currently assumes that a successful submit
must result in `task.execution.state != NEW`.
I think the former interface (`submit()` is successful unless an
exception is thrown) is simpler and I should adapt the scheduler
accordingly. But I'm open to discussion: what's your opinion here?
Original comment by riccardo.murri@gmail.com
on 21 Oct 2013 at 10:57
On 22 October 2013 00:57, <gc3pie@googlecode.com> wrote:
| However, the scheduler code currently assumes that a successful submit
| must result in `task.execution.state != NEW`.
Must be precise here: the code in "first_come_first_serve()" assumes
that. The `Scheduler` class interface is quite agnostic to what
`Core.submit()` does and needs no be changed.
Original comment by riccardo.murri@gmail.com
on 21 Oct 2013 at 11:11
This issue was closed by revision r3738.
Original comment by antonio....@gmail.com
on 22 Oct 2013 at 6:30
I'm not sure I understand your objection. There is no problem, nothing failed,
it's just that the applications composing a collection are enqueued by the
engine when submitted, instead of actually submitted on the remote resource.
Original comment by antonio....@gmail.com
on 22 Oct 2013 at 6:39
I'm re-opening this: please report if the patch in SVN r3739 works. I'll close
this after confirmation.
Original comment by riccardo.murri@gmail.com
on 22 Oct 2013 at 10:18
Original issue reported on code.google.com by
antonio....@gmail.com
on 21 Oct 2013 at 4:55Attachments: