Handle failures of load generator

stellar-deprecated / stellar_core_commander

A system of creating isolated Stellar test networks into which you can play transactions and record results.

Apache License 2.0

19 stars 27 forks source link

Handle failures of load generator #135

Closed marta-lokhova closed 4 years ago

marta-lokhova commented 5 years ago

somehow i can't assign reviewers for this PR.. either way, @MonsieurNicolas we should probably land this soon, this accommodates new changes in loadgen https://github.com/stellar/stellar-core/pull/2220

MonsieurNicolas commented 5 years ago

do we really need this @martalo ? I mean current logic will just cause a timeout if it fails as loadgen never succeeds. Also, I don't see why we'd retry loadgen itself: if it fails once, I'd rather fail the test

marta-lokhova commented 5 years ago

mm, sure, breaking on failure just means we'd fail the test faster (as opposed to waiting for a potentially huge number of retries). So maybe it should just raise if it sees a failure, and not retry load generation?

MonsieurNicolas commented 5 years ago

yeah. So my question is do we even bother to "fast fail" as there is already a timeout?

marta-lokhova commented 5 years ago

oh i think our timeouts are just too long. If i'm trying to create 1000 accounts, the timeout is 10003, so 50 min. Maybe the trick is to just reduce the timeout, e.g. `(numAccountsOrPayments / (batchSize txRate)) * 2` (double the expected time just in case it takes longer to close ledgers under higher loads), what do you think?

MonsieurNicolas commented 5 years ago

Actually, maybe we can't merge this change as is: we use load generator on both old and new builds so the new logic that sets loadgen.run.failed won't exist at all.

So yeah I think we need to tune down num_retries to something quite a bit smaller like what you suggested.

marta-lokhova commented 5 years ago

@MonsieurNicolas the previous change is compatible with old builds, as the commander catches exceptions when accessing metrics. but i agree that we probably don't need it and can rely on retry timeout. either way, i pushed an update (i tripled the expected time instead of doubling to give it more time to recover under high loads)

marta-lokhova commented 4 years ago

This is stale now, closing