juju / charm-helpers

Apache License 2.0
18 stars 127 forks source link

distributed_wait function in hahelpers/cluster.py isn't optimal for non-sequential unit numbers #142

Open bbcmicrocomputer opened 6 years ago

bbcmicrocomputer commented 6 years ago

distributed_wait function in contrib/hahelpers/cluster.py relies on function modulo_distribution in core/host.py.

The modulo_distribution function assumes the unit numbers will be close together, say units 0,1,2. However, if you add and then remove units, you may have say 3 nodes of units 0, 3, 6 which all give the same wait time when using a modulo of 3. For tasks like service restarts this can cause some patterns of unit numbers to all restart at the same time which isn't the desired outcome!

A potential alternative solution would be for each charm to use a peer relation (existing or new) to work out its sequenced wait time based on the number of units.

afreiberger commented 4 years ago

Are there any charms today that use this helper that don't have a peer relation that could be coded for and around? Maybe we could put in a try to query on relation_ids("cluster") or relation_ids("peer") and if that returns, utilize that list of actual unit numbers to sort a list and return this unit's offset within that list to the current charm. The other option is to at least add a warning that if that relation exists and any units of that relation return the same modulo, to put a big warning in the unit logs that modulo-nodes may need tweaking.