When provisioning systems, sometimes the request can fail because of a transient issue in the service providing the target system. In this case litmus should automatically retry the provisioning task using randomized exponential backoff and a low number of maximum tries to reduce environmental failures when running tests.
when the task returns an error indicating a transient server issue (kind: 'provision/transient_error'), instead of raising an error and aborting, the code should retry the call using the retries gem with max_tries=3; base_sleep_seconds=10; max_sleep_seconds=60. This should give the the service enough time to recover in-between tries.
Once litmus has this capability, change the various provision tasks to return kind: 'provision/transient_error' in the appropriate circumstances (that is, in error situations where the target system can plausibly recover by waiting a short time; for example network timeouts can be retried; negative example: disk full might require human intervention).
Use Case
When provisioning systems, sometimes the request can fail because of a transient issue in the service providing the target system. In this case litmus should automatically retry the provisioning task using randomized exponential backoff and a low number of maximum tries to reduce environmental failures when running tests.
Describe the Solution You Would Like
In
https://github.com/puppetlabs/puppet_litmus/blob/877ab3e341d83a35757f4d90388dc522f76e24c4/lib/puppet_litmus/rake_helper.rb#L137-L141
when the task returns an error indicating a transient server issue (
kind: 'provision/transient_error'
), instead of raising an error and aborting, the code should retry the call using theretries
gem with max_tries=3; base_sleep_seconds=10; max_sleep_seconds=60. This should give the the service enough time to recover in-between tries.Once litmus has this capability, change the various provision tasks to return
kind: 'provision/transient_error'
in the appropriate circumstances (that is, in error situations where the target system can plausibly recover by waiting a short time; for example network timeouts can be retried; negative example: disk full might require human intervention).Additional Context
Provisioning failures are now one of the leading failure cases of our scheduled module acceptance runs. See https://ui.honeycomb.io/puppet-modules/datasets/litmus-tests/result/2gyHZHCLPYq for tracking.