IndexError in the Engine while running warholize

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Run warholize using a torque cluster (the problem didn't appear with 
localhost, I didn't check other resource types)

What is the expected output? What do you see instead?

The first application is run as expected, but an error is raised (see attachd 
log)

It affects the *master* branch, revision 3607

Original issue reported on code.google.com by antonio....@gmail.com on 21 Oct 2013 at 4:55

Attachments:

issue.log

GoogleCodeExporter commented 9 years ago

Some insights:

In core.py:1240 there is the loop:

    for task_index, resource_name in sched:
        # OMISSIS
        try:
            self._core.submit(task, targets=[resource])
            if self._store:
                self._store.save(task)
            self._in_flight.append(task)
            transitioned.append(task_index)
            if isinstance(task, Application):
                currently_submitted += 1
                currently_in_flight += 1
            sched.send(task.execution.state)

The `first_come_first_serve` function in core.py:922 the loop however is:

    for task_idx, task in enumerate(tasks):
        # OMISSIS
        targets = matchmaker.rank(task, compatible_resources)
        # now try submission of the task to each resource until one succeeds
        for target in targets:
            result = yield (task_idx, target.name)
            if result != Run.State.NEW:
                # task accepted by resource, continue with next task
                break

Assuming the first task is a TaskCollection, which cannot be submitted
by itself, and assuming 3 resources are defined, the scheduler will
yield:

* (0, resource1)
* (0, resource2)
* (0, resource3)

unless it is stopped by a `send`, which happens only if the task is
not in NEW state anymore.

In warholize however this is the result:

the `transitioned` list will be equal to

    [0, 0, 0]

while `self._in_flight` will contain:

    [self._new[0], self._new[0], self._new[0]]

This because after submitting the first ParallelTaskCollection, for
some reason, the state is still NEW. I guess this is because the tasks
in the ParallelTaskCollection are already in `queue`, which in this
case is exactly `self._new`. This is done in core.py:1068

    if not _contained(task, queue):
        queue.append(task)
        task.attach(self)

If this analysis is correct, the only thing to change is to ensure
that we do not add a task in self._in_flight if it is already there,
and not appending the index to transitioned if it is already there.

I would change, therefore, the lines core.py:1248-1249 from:

    self._core.submit(task, targets=[resource])
    if self._store:
        self._store.save(task)
    self._in_flight.append(task)
    transitioned.append(task_index)

to

    self._core.submit(task, targets=[resource])
    if self._store:
        self._store.save(task)
    if task_index not in transitioned:
        self._in_flight.append(task)
        transitioned.append(task_index)

Original comment by antonio....@gmail.com on 21 Oct 2013 at 8:15

Changed state: Started

GoogleCodeExporter commented 9 years ago

This patch makes the warholize example work.

Original comment by antonio....@gmail.com on 21 Oct 2013 at 8:39

Attachments:

issue_419.patch

GoogleCodeExporter commented 9 years ago

Thanks for the code analysis and the patch!  I have no objections to
applying it as an interim solution.

I think the patch works around the problem, but the root issue is a
bit deeper and lies in a different contract for `Core.submit()`
assumed by the `Engine` and `Scheduler` objects.

The `Engine`, as it was before applying the new scheduler code and
partially is now, assumes that `Core.submit()` will *always* throw an
exception if something failed; in other words, exception-less submit()
resulted in a submitted job, regardless of job state.

However, the scheduler code currently assumes that a successful submit
must result in `task.execution.state != NEW`.

I think the former interface (`submit()` is successful unless an
exception is thrown) is simpler and I should adapt the scheduler
accordingly.  But I'm open to discussion: what's your opinion here?

Original comment by riccardo.murri@gmail.com on 21 Oct 2013 at 10:57

GoogleCodeExporter commented 9 years ago

On 22 October 2013 00:57,  <gc3pie@googlecode.com> wrote:
| However, the scheduler code currently assumes that a successful submit
| must result in `task.execution.state != NEW`.

Must be precise here: the code in "first_come_first_serve()" assumes
that.  The `Scheduler` class interface is quite agnostic to what
`Core.submit()` does and needs no be changed.

Original comment by riccardo.murri@gmail.com on 21 Oct 2013 at 11:11

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r3738.

Original comment by antonio....@gmail.com on 22 Oct 2013 at 6:30

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I'm not sure I understand your objection. There is no problem, nothing failed, 
it's just that the applications composing a collection are enqueued by the 
engine when submitted, instead of actually submitted on the remote resource.

Original comment by antonio....@gmail.com on 22 Oct 2013 at 6:39

GoogleCodeExporter commented 9 years ago

I'm re-opening this: please report if the patch in SVN r3739 works. I'll close 
this after confirmation.

Original comment by riccardo.murri@gmail.com on 22 Oct 2013 at 10:18

Changed state: Started

goerz / gc3pie

IndexError in the Engine while running warholize #419