Strategy timeouts cause inconsistent state

When running pytest tests using the strategies shipped with labgrid, the state can get confused if a timeout occurs while bringing up the target.

I ran into this issue when doing some simple endurance testing.

By way of an example, consider what happens if you run the following simple tests using the UBootStrategy (although others have the same problem).

@pymark.test.parameterize("iteration", [1, 2, 3, 4, 5])
def test_boot(target, strategy, iteration):
   strategy.transition("shell")

Now suppose that a timeout occurs on iteration 1 while waiting for a login prompt. This causes the first test to fail, and leaves our strategy in the "u-boot" state.

At this point, we're in an inconsistent state. The target is almost certainly not at the u-boot prompt... It could be hung, or crashed, or it could have simply took too long to boot and is now at the login prompt.

Unfortunately, this will cause test 2 and all following tests to fail.

First because the u-boot driver was deactivated when the shell driver was activated in test #1, so when it tries to run:

        elif status == Status.shell:
            # transition to uboot
            self.transition(Status.uboot)
            self.uboot.boot("")
            self.uboot.await_boot()

It skips over the transistion to uboot (since it is already there), but fails with an StateError when it tries to run uboot.boot("")

E           labgrid.binding.StateError: UBootDriver(target=Target(name='main', env=Environment(config_file='test.yaml')), name=None, state=<BindingState.bound: 1>, prompt='=> ', autoboot='stop autoboot', password='', interrupt='\n', init_commands=(), password_prompt='enter Password:', boot_expression='', bootstring='Linux version \\d', boot_command='run bootcmd', boot_commands={}, login_timeout=30, boot_timeout=30) has not been activated, UBootDriver.boot cannot be called in state "bound"

Even if that weren't the case and we reactivated the uboot driver, our strategy would get a tiny bit farther, but will timeout waiting for the kernel to start booting in uboot.await_boot().

I think what needs to happen is for the Strategy to catch (and later re-reraise) any exceptions when it performs a transition, and set the state to "unknown".

Tests should generally leave the system in a known state. If a test is known to be possibly disruptive, the strategy state can be reset to "unknown" in an except block before re-raising the exception. If a test always disrupts the system state intentionally, resetting the strategy's state can also be done during teardown.

To reset the strategy's state, there is the GraphStrategy's invalidate() method. For regular strategies, you could implement something similar in your strategy. We should improve the example strategies in labgrid like this as well.

This should be used with care, though: If the strategy fails on its own, you probably don't want to reset to the "unknown" state on every failing call to transition(). Imagine a strategy bootstrapping a system by flashing its eMMC on every test in a large test suite with hundreds of tests. A possible solution could be to introduce an error counter. Under a certain threshold it could reset to "unknown". Once it reaches the threshold, it could enter an error state. Subsequent transition() calls would then raise an exception immediately.

I agree with everything you said, but it doesn't address my concern. I may not be explaining it clearly. My main point is:

Anytime a strategy throws an exception during a transition, it is by definition in an unknown state.

None of the sample strategies included in labgrid try to capture any of the exceptions that could be thrown. I think exceptions should caught inside the strategy, the status should be set to unknown, and the exception re-raised.

Your point about being stuck in a loop causing eMMC to be flashed seems beside the point. I agree that you should try to avoid wearing out things like SD or eMMC accidentally, but even in your own video tutorial on writing a custom strategy (https://www.youtube.com/watch?v=APIA8aUH98Q), this doesn't apply since you wisely use a separate boolean ('bootstrapped') to check if the target has been flashed or not. For the reasons you outlined above, it seems obvious to me that the target's "flashing state" and "running state" should be managed separately.

I don't understand how an error counter helps. :/

Consider the following simple (and not completely hypothetical) scenario...

I have a target that is slightly flakey when booted a particular way, and will sometimes hang on boot.
I'm not sure exactly how flakey it is, or when it fails, so I'm trying to get more information. Is it 1/10, 1/100, 1/1000?
I'm trying to figure this out by re-running a simple boot test continuously (like the one I listed in my original example, using pytest.mark.parametrize with a range)
When the boot test fails in the strategy transition (because the target is hung), the test should fail.

However, the issue is that currently, if I'm using one of the strategies included in labgrid, every test after my first boot failure will also fail, because the strategy thinks the target is waiting at a u-boot prompt, when the target is in fact hung.

I agree with everything you said, but it doesn't address my concern. I may not be explaining it clearly. My main point is:

Anytime a strategy throws an exception during a transition, it is by definition in an unknown state.

None of the sample strategies included in labgrid try to capture any of the exceptions that could be thrown. I think exceptions should caught inside the strategy, the status should be set to unknown, and the exception re-raised.

My general understanding about exceptions is: if an exception is raised, you get to keep the pieces. :grin: There is no guarantee (and we don't make any promises) that the object's state is consistent afterwards.

That being said, I think it would be a good idea to implement resetting the state to "unknown" if an exceptions occurs, so the test suite can continue if the exception is caused by a flaky device under test. Note that infrastructure failures (e.g. disappearing USB devices like USB-SD-Muxes on the exporter) will also cause the state to be reset.

The test suites I maintain usually use state fixtures similar to this:

import traceback

import pytest

@pytest.fixture
def shell(strategy, target):
    try:
        strategy.transition("shell")
    except Exception as e:
        traceback.print_exc()
        pytest.exit(f"Transition into shell state failed: {e}", returncode=3)

    return target.get_driver("ShellDriver")

This basically means: All strategy errors are considered critical and non-recoverable. The devices under tests are not flaky, though.

Your point about being stuck in a loop causing eMMC to be flashed seems beside the point. I agree that you should try to avoid wearing out things like SD or eMMC accidentally, but even in your own video tutorial on writing a custom strategy (https://www.youtube.com/watch?v=APIA8aUH98Q), this doesn't apply since you wisely use a separate boolean ('bootstrapped') to check if the target has been flashed or not.

Ah, you mean resetting the strategy to "unknown" wouldn't trigger the boostrapping again, because the boostrapped attribute shouldn't be reset. I guess that makes sense.

For the reasons you outlined above, it seems obvious to me that the target's "flashing state" and "running state" should be managed separately.

Agreed.

I don't understand how an error counter helps. :/

Depending on the number of tests and the time it takes to boot the device, you still might want to limit the number of "resets to unknown" in the strategy. But I agree, wear-out concerns are a non-issue if we track the boostrapped state with a separate attribute.

Consider the following simple (and not completely hypothetical) scenario...

I have a target that is slightly flakey when booted a particular way, and will sometimes hang on boot.

I'm not sure exactly how flakey it is, or when it fails, so I'm trying to get more information. Is it 1/10, 1/100, 1/1000?

I'm trying to figure this out by re-running a simple boot test continuously (like the one I listed in my original example, using pytest.mark.parametrize with a range)

When the boot test fails in the strategy transition (because the target is hung), the test should fail.

However, the issue is that currently, if I'm using one of the strategies included in labgrid, every test after my first boot failure will also fail, because the strategy thinks the target is waiting at a u-boot prompt, when the target is in fact hung.

Understood.

We continued this discussion internally:

strategies need to handle expected failures correctly themselves, e.g. bootloader not starting on every power-on
generally resetting to the "unknown" state on exceptions in transition() is surprising and wrong
- this could easily confuse tests, e.g. atomic update tests that call transition() multiple times to reboot the device: the system might be in an unexpected state due to recovery from the "unknown" state
a "broken" flag could be implemented in the base strategy, exceptions should be raised as StrategyTransitionErrors, so they can be handled
the state machine could be moved from transition() to _transition(), transition() could then wrap _transition() and set the broken flag in the except clause
- fixtures/tests/scripts could handle the StrategyTransitionError individually, e.g. by
- recovering via invalidate()/reset to the "unknown" state
- recovering as above, with threshold (as outlined in one of the posts above)
- skipping all remaining tests
- exiting: pytest.exit(), sys.exit()
- if multiple functions/methods should handle StrategyTransitionError the same way, exception handling could be implemented by a decorator

Changes that should be implemented in labgrid:

[ ] add NotImplemented function stubs in the base strategy:
- invalidate() or reset()
- something like break(), set_broken(), failed(), make_unusable()
[ ] emit a warning if transition() is overwritten and recommend the new wrapped variant (see above)
[ ] document best practices

@jpuderer Could you share your thoughts about this?

strategies need to handle expected failures correctly themselves, e.g. bootloader not starting on every power-on

Agreed.

generally resetting to the "unknown" state on exceptions in transition() is surprising and wrong

this could easily confuse tests, e.g. atomic update tests that call transition() multiple times to reboot the device: the system might be in an unexpected state due to recovery from the "unknown" state

Not sure I completely agree. To me, having the strategy reset to an "unknown" state seems reasonable, but if and only if the strategy also throws an exception (like StrategyTransitionError) that interrupts the test. If the test doesn't catch StrategyTransitionError (which you would assume is the default case), the test will fail and exit, transition() will not be called any more times. If the test catches StrategyTransitionError, then it knows what it's doing, and can expect the target to be in an unknown state.

It's not surprising and wrong, because when each test starts you don't actually know what the state of the target is. It could be anything, including "unknown". Each individual test can't make any assumptions about the state of the target that the previous test left it in. If they do, they are depending on side effects from running the tests in a particular order, which is.... surprising and wrong. ;)

a "broken" flag could be implemented in the base strategy, exceptions should be raised as StrategyTransitionErrors, so they can be handled

What would the "broken" flag do? What are you expecting should happen if the broken flag has been set?

the state machine could be moved from transition() to _transition(), transition() could then wrap _transition() and set the broken flag in the except clause

fixtures/tests/scripts could handle the StrategyTransitionError individually, e.g. by

recovering via invalidate()/reset to the "unknown" state

recovering as above, with threshold (as outlined in one of the posts above)

skipping all remaining tests

exiting: pytest.exit(), sys.exit()

if multiple functions/methods should handle StrategyTransitionError the same way, exception handling could be implemented by a decorator

I agree with this. I still think the strategy could reset the target to the "unknown" state, but that tests should (or need to ) be able to catch StrategyTransitionError. This would be very useful for stress/endurance tests where what you're testing is the actual transition.

generally resetting to the "unknown" state on exceptions in transition() is surprising and wrong

this could easily confuse tests, e.g. atomic update tests that call transition() multiple times to reboot the device: the system might be in an unexpected state due to recovery from the "unknown" state

Not sure I completely agree. To me, having the strategy reset to an "unknown" state seems reasonable, but if and only if the strategy also throws an exception (like StrategyTransitionError) that interrupts the test. If the test doesn't catch StrategyTransitionError (which you would assume is the default case), the test will fail and exit, transition() will not be called any more times. If the test catches StrategyTransitionError, then it knows what it's doing, and can expect the target to be in an unknown state.

It's not surprising and wrong, because when each test starts you don't actually know what the state of the target is. It could be anything, including "unknown". Each individual test can't make any assumptions about the state of the target that the previous test left it in. If they do, they are depending on side effects from running the tests in a particular order, which is.... surprising and wrong. ;)

What we were thinking about is a test-case similar to this:


def test_update(strategy):
   strategy.transition('linux')
   ... install update ...
   strategy.transition('off')
   strategy.transition('linux')
   ... checkt that the update was installed and activated correctly...

You're right. As long as the test doesn't catch the StrategyTransitionError, this would detect issues correctly.

a "broken" flag could be implemented in the base strategy, exceptions should be raised as StrategyTransitionErrors, so they can be handled

What would the "broken" flag do? What are you expecting should happen if the broken flag has been set?

The reasoning was: If the strategy detects that it encountered a persistent issue, it could set a "broken" flag and not attempt further transitions. The benefit would be to avoid useless repeated attempts to re-provision a target for each test function, which would waste time and perhaps cause flash wear-out.

As a separate idea: The generic wrapping transition() could keep an error counter per state and set the broken flag when a limit is exceed. The limit should be configurable (including disabling it). That would limit the possible "damage" a persistent error could do.

labgrid-project / labgrid

Strategy timeouts cause inconsistent state #1123