return an error when an AWS API fails so that peer discovery will be retried

stefanmoser commented 4 years ago

Proposed Changes

Discussion thread: https://groups.google.com/g/rabbitmq-users/c/FN1hHZS7BPs

Currently, when an AWS API call fails, the plugin swallows the error and acts as if there are no peers to cluster with. This results in nodes erroneously creating multiple clusters. This behaviour is contrary to the current documentation that states that "with the AWS mechanism, EC2 API requests are retried".

Returning an error here means that rabbitmq-server will retry with the configured values and brings the behaviour inline with what the documentation says should happen.

Types of Changes

What types of changes does your code introduce to this project? Put an x in the boxes that apply

[ ] Bug fix (non-breaking change which fixes issue #NNNN)
[ ] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause an observable behavior change in existing systems)
[ ] Documentation improvements (corrections, new content, etc)
[ ] Cosmetic change (whitespace, formatting, etc)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask on the mailing list. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

[x] I have read the CONTRIBUTING.md document
[ ] I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
[x] All tests pass locally with my changes
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] I have added necessary documentation (if appropriate)
[ ] Any dependent changes have been merged and published in related repositories

Further Comments

This seems like the smallest change I could make that would have the desired effect and I hope it stimulates a conversation.

pivotal-issuemaster commented 4 years ago

@stefanmoser Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

michaelklishin commented 4 years ago

This is on the right track, I'll have to check what value we should return to get more sensible logging.

stefanmoser commented 4 years ago

From my testing, there is still full logging within the AWS peer discovery plugin of what happened (why the API call failed), along with logging in core that retries will happen. Do other peer discovery plugins provide additional logging or error messages back to core?

michaelklishin commented 4 years ago

@stefanmoser the returned value is returned from rabbitmq_peer_discovery_aws:list_nodes/0, which is used by rabbit_mnesia:run_peer_discovery_with_retries/2, and that function expects errors to be reported as {error, term()}.

michaelklishin commented 4 years ago

@stefanmoser can you please try #40 out and see if it works as expected in your tests?

rabbitmq / rabbitmq-peer-discovery-aws