kinvolk / lokomotive

🪦 DISCONTINUED Further Lokomotive development has been discontinued. Lokomotive is a 100% open-source, easy to use and secure Kubernetes distribution from the volks at Kinvolk
https://kinvolk.io/lokomotive-kubernetes/
Apache License 2.0
321 stars 49 forks source link

Handle random cloud API errors #776

Open johananl opened 4 years ago

johananl commented 4 years ago

In some cases an operation against a cloud provider's API may fail at random, for example due a network glitch, some rare race condition or an eventual consistency problem. Right now such cases cause cluster operations (apply or destroy) to fail, whereas when re-running the same operations often works fine (though not always).

Here is a sample such failure:

module.aws-johannes-test.aws_subnet.public[1]: Destroying... [id=subnet-00b626c065bb75527]
module.aws-johannes-test.aws_subnet.public[0]: Destroying... [id=subnet-06c326b9074582e30]
module.aws-johannes-test.aws_subnet.public[2]: Destroying... [id=subnet-0bac9efa6d101cd5f]
module.aws-johannes-test.aws_subnet.public[1]: Destruction complete after 1s
module.aws-johannes-test.aws_subnet.public[0]: Destruction complete after 1s
module.aws-johannes-test.aws_subnet.public[2]: Destruction complete after 1s
module.aws-johannes-test.aws_internet_gateway.gateway: Still destroying... [id=igw-07265ad9648d4c3b4, 3m10s elapsed]
module.aws-johannes-test.aws_internet_gateway.gateway: Destruction complete after 3m12s

Error: RequestError: send request failed
caused by: Post https://ec2.eu-central-1.amazonaws.com/: read tcp 192.168.0.4:51710->54.239.55.102:443: read: connection reset by peer

FATA[0220] error destroying cluster: failed checking execution status: exit status 1  args="[]" command="lokoctl cluster destroy"

It would be nice if we could somehow compensate for such failures, for example by tweaking parameters such as max_retries. We should find a good balance so that we don't fail the entire process due to a single failure while on the other hand we don't leave the user hanging for a very long time in case an API operation is consistently failing.

invidian commented 3 years ago

Duplicate of #25?