dynamic: constellix provider records deleted during pool updates

istr commented 2 years ago

There is a severe problem in the constellix provider that makes the dynamic feature unusable for me in the current state. During updates of records with dynamic set the existing entries are deleted and that leads to intermittent outages.

You can observe this behavior if you log in to the constellix Web UI and watch your record while the octodns update is in progress.

This might pass through unnoticed with long TTLs but leads to failures with short TTLs, because the update process is comparatively slow (constellix providers needs about a minute where e.g. ns1 works in seconds for the same configuration).

Expected behavior: updates are done in a single transaction (like possible in the Web UI) or at least existing answers are not deleted during pool updates. There is an API call to bulk update multiple pools. Maybe this could improve the situation.

ross commented 2 years ago

Nice catch. Very few APIs support atomic operations, iirc Dyn was the only one that truly does so. Route53 sort of supports it with changes in batches which in many cases lets you get atomic, but there's a limit to the change set size there and if you're record is big/involved enough it can still get split up.

In general providers try and add new things before deleting old so that worst case something gets returned to users, but again this is not always possible as some apis/providers throw consistency errors in such cases.

One thing that some of the dynamic providers additionally do to try and minimize problems is compute "diffs" and only make the exact changes necessary. I.e. change rules/pools that need modified, add those that are new, and remove stuff that's no longer needed (in that order when possible.)

The TTL only papers over the problem and lowest the probability of an issue, but it doesn't solve it.

istr commented 2 years ago

I guess that the problem could be solved if all providers handled the global fallback separately from the pools. See https://github.com/octodns/octodns/issues/825.

The documentation says

# These values become a non-healthchecked default pool
values:
- 5.5.5.5
- 6.6.6.6
- 7.7.7.7

The problem could be solved if that global fallback is always treated separately from the configured pools. At least this entry could be served all the time and the update of these values could be made atomic in any case (at least for the providers I tried so far). That would probably lead to some "suboptimal" answers during longer update phases (especially with throttling in place, that is why I observed it), but at least there will be some answer all the time.

So maybe it would be a good idea to add some global test cases for the dynamic feature that asserts that for each provider

the global fallback is never derived from the pool configuration (currently the constellix provider violates this)
the global fallback is served even during upgrades

I am currently working on a patch to address https://github.com/octodns/octodns/issues/825, so that could probably solve this problem at least for constellix.

istr commented 2 years ago

@ross This issue should be transferred to https://github.com/octodns/octodns-constellix now.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

octodns / octodns-constellix

dynamic: constellix provider records deleted during pool updates #7