Closed marta-lokhova closed 4 years ago
somehow i can't assign reviewers for this PR.. either way, @MonsieurNicolas we should probably land this soon, this accommodates new changes in loadgen https://github.com/stellar/stellar-core/pull/2220
do we really need this @martalo ? I mean current logic will just cause a timeout if it fails as loadgen never succeeds. Also, I don't see why we'd retry loadgen itself: if it fails once, I'd rather fail the test
mm, sure, breaking on failure just means we'd fail the test faster (as opposed to waiting for a potentially huge number of retries). So maybe it should just raise if it sees a failure, and not retry load generation?
yeah. So my question is do we even bother to "fast fail" as there is already a timeout?
oh i think our timeouts are just too long. If i'm trying to create 1000 accounts, the timeout is 10003, so 50 min. Maybe the trick is to just reduce the timeout, e.g. `(numAccountsOrPayments / (batchSize txRate)) * 2` (double the expected time just in case it takes longer to close ledgers under higher loads), what do you think?
Actually, maybe we can't merge this change as is: we use load generator on both old and new builds so the new logic that sets loadgen.run.failed
won't exist at all.
So yeah I think we need to tune down num_retries
to something quite a bit smaller like what you suggested.
@MonsieurNicolas the previous change is compatible with old builds, as the commander catches exceptions when accessing metrics. but i agree that we probably don't need it and can rely on retry timeout. either way, i pushed an update (i tripled the expected time instead of doubling to give it more time to recover under high loads)
This is stale now, closing
Related to https://github.com/stellar/stellar-core/pull/2220