Cloudflare TTL 'auto' differences

KRoss45 commented 9 months ago

Issue is an assumption of what's going on.

We've been using 0.9.11 forever and have started testing 1.x. Last round of tests was with 1.2.1.

What is seen in testing on a single domain is the TTL is flagged for update. eg:

*   Update
*     <ARecord A 300, activeactive.lavexproducts.com., ['192.168.55.22', '10.20.55.22']> ->
*     <ARecord A 1, activeactive.lavexproducts.com., ['192.168.55.22', '10.20.55.22']> (config)

In Cloudflare, the above A record is set to Auto, where 1 is what you set in the config.
Config:

activeactive:
  octodns:
    cloudflare:
      proxied: true
  ttl: 1
  type: A
  values:
    - 192.168.55.22
    - 10.20.55.22

The output snip from above is another dry run after applying the updates already; so octodns wants to re-apply the same change despite having just done it.

Running with octodns-cloudflare 0.0.3

ross commented 9 months ago

Interesting. Don't know of anything off-hand that would have changed between 0.9.x and 1.x around TTL handling, but the cloudflare auto stuff has always been finicky so not super surprised. Will see if I can reproduce it. Sounds like the record was created by octoDNS and then persistent changes are being seen after that. That correct?

KRoss45 commented 9 months ago

Correct. This is a recent domain add so it's one of the cleanest we have. With going to 1.x we've seen this issue across all of the zones where it's seeing '300' for the TTL when config says '1'.

I ran a dry run sync using our current in use and the new version. The debug is showing a different TTL back from (I'm assuming) the provider pulling the live config:

.9.11
[140632777743232] DEBUG Zone changes: zone=Zone<lavexproducts.com.>, n.c. record=<ARecord A 1, activeactive.lavexproducts.com.,

1.2
[140327634467712] DEBUG Zone changes: zone=Zone<lavexproducts.com.>, modified
    existing=<ARecord A 300, activeactive.lavexproducts.com.,

I re-confirmed, CF currently showing auto for the given A record.

KRoss45 commented 9 months ago

I walked the versions backward and I've narrowed it down to a change that happened in .9.14. I see it supports both the old and the new CF providers in the config. Using either of these providers:

<     class: octodns.provider.cloudflare.CloudflareProvider 
---
>     class: octodns_cloudflare.CloudflareProvider

I get the 'wrong' TTL info coming back from the live CF info where it wants to do an update.

.9.13 doesn't support the newer octodns_cloudflare.CloudflareProvider and the sync dry run comes back clean.

istr commented 9 months ago

@KRoss45 I think you should move this issue to the octodns-cloudflare provider/repo so we'll fix it there.

There is this diff 0.9.13 vs. 0.9.14 that looks suspicious to me and matches the behavior you describe:

@@ -166,10 +168,13 @@ class CloudflareProvider(BaseProvider):
                 else:
                     page = None

-            self._zones = {'{}.'.format(z['name']): z['id'] for z in zones}
+            self._zones = {f'{z["name"]}.': z['id'] for z in zones}

         return self._zones

+    def _ttl_data(self, ttl):
+        return 300 if ttl == 1 else ttl
+
     def _data_for_cdn(self, name, _type, records):
         self.log.info('CDN rewrite for %s', records[0]['name'])
         _type = "CNAME"
@@ -177,14 +182,14 @@ class CloudflareProvider(BaseProvider):
             _type = "ALIAS"

         return {
-            'ttl': records[0]['ttl'],
+            'ttl': self._ttl_data(records[0]['ttl']),
             'type': _type,
-            'value': '{}.cdn.cloudflare.net.'.format(records[0]['name']),
+            'value': f'{records[0]["name"]}.cdn.cloudflare.net.',
         }

     def _data_for_multiple(self, _type, records):
         return {
-            'ttl': records[0]['ttl'],
+            'ttl': self._ttl_data(records[0]['ttl']),
             'type': _type,
             'values': [r['content'] for r in records],
         }

KRoss45 commented 9 months ago

@istr I would, but despite trying multiple browsers I don't have the right hand sidebar and no move button to be seen. We have a private GitLab that I checked on that indeed has the button in the sidebar, so I'm not sure where the fault lies.

istr commented 9 months ago

@KRoss45 here you go. By "you should move" I meant "close it there, open it here". I think there is no other way for you to do this at the moment.

KRoss45 commented 9 months ago

@KRoss45 here you go. By "you should move" I meant "close it there, open it here". I think there is no other way for you to do this at the moment.

Ahh, sorry I was used to having the functionality and got wrapped around the axle trying to figure it out. TY for moving it, it saved me from copying over all the updated findings on my end.

istr commented 9 months ago

I think the fix would be simple. Revert the logic in

    def _ttl_data(self, ttl):
        return 300 if ttl == 1 else ttl

and clearly document the special meaning of 1 as a TTL in the README.md.

@ross However, I am not sure if we want to have that "fixed", since it is a rather uncommon feature that is not really in line with "support a common baseline for lots of providers and streamline the results".

ross commented 9 months ago

Tracked that change to where it was made back before things were extracted and found a conversation about it

https://github.com/octodns/octodns/pull/722/files#r675836640

Amazingly it looks like I had a PR with the comment about the details that was closed when things were moved out and the actual change seems to come from https://github.com/octodns/octodns/pull/210 which has the reasoning.

It looks like I made that change on purpose so that dumps from cloudflare would get the actual behavior and not have ttl=1 in them (which would behave differently when pushed to other providers.)

I think this is our best option for handling the automatic TTLs with the mindset of trying to get multiple providers to behavior the same way from a single config (octoDNS's mindset.)

Not saying that the behavior is right, the years since have erased any relevant thinking/details.

I think this kind of boiled down to best practice is don't use auto and put 300 ttls on those records since that's what Cloudflare actually says it does under the hood.

Feel free to discuss/argue against that, but thought the history/context was relevant.

KRoss45 commented 9 months ago

I'm confused how hard-coding a value to a single provider behavior to "make them all behave the same" is valid justification for creating the issue at hand here. In doing so, we now have a perpetual update happening to our zones because the config value does not match the live value since the live value has been adulterated.

There's got to be a better way of handling the difference without corrupting the live config info.

ETA: Ugh I suck at gitlabing. I din't mean to close this, I mis-clicked.

KRoss45 commented 9 months ago

Reopening because I'm clumsy.

ross commented 9 months ago

I'm confused how hard-coding a value to a single provider behavior to "make them all behave the same" is valid justification for creating the issue at hand here. In doing so, we now have a perpetual update happening to our zones because the config value does not match the live value since the live value has been adulterated.

octoDNS's primary purpose is to allow users to define their DNS config once and push it to multiple providers and get (as close to) the same behavior as possible. To easily be able to enable split authority DNS and avoid the SPOF of a DNS provider and/or more generally allow easy DNS portability by changing providers whenever desired.

To that end Cloudflare's auto is only a thing for CF and no other provider seems to have something similar. In practice the auto just means 300s (per their doc.) So the best option for getting multiple providers to behave the same is to put 300 on those auto records in the Yaml config as that's what they're going to be doing anyway. That way when the config is pushed to other providers they'll get the same behavior of 300s TTLs. If the config has 1s in it and was pushed to any other provider that would be a literal 1s, effectively no TTL.

istr commented 9 months ago

In doing so, we now have a perpetual update happening to our zones because the config value does not match the live value since the live value has been adulterated.

From my point of view, the live value of 1, which denotes a different TTL (namely auto -> 300), is the problem here, as it is simply a misrepresentation in the API. If you set your config to a reasonable value, this will be supported by most (or all) providers. While the standard allows for any 32-bit signed integer TTL value ("For example, SOA records are always distributed with a zero TTL to prohibit caching."), there is no real consensus on what a "reasonable" lower bound means. However, it seems to be somewhere between 30sec and 1min. It also seems that CF does not support real values lower than 30.

See also:

https://ns1.com/blog/what-is-the-lowest-ttl-i-can-get-away-with
https://developers.cloudflare.com/dns/manage-dns-records/reference/ttl/
https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/best-practices-dns.html
https://constellix.com/news/best-time-to-live-ttl-settings ("While you can set TTL as low as 0 for a DNS record in Constellix, some resolvers may not support values under 30 seconds. If you want a TTL value that is less than 30 seconds, be sure to test the record first.")

@ross imho we could even consider adding a sanity check (and warning or info message) for the lower bound of the TTL, so the user can see the minimum value supported by all providers in a given configuration.

KRoss45 commented 9 months ago

My criticism is in the effort of trying to make them all be "the same" you've only created the illusion of it and in turn introduced a false positive for "out of sync" with the config. They can't be the same for the simple fact that they are not.

To me, a cleaner approach would be to add functionality to the app end for an 'auto' flag and if the provider supports it, utilize the known provider value for auto and ignore the value of TTL altogether. If the provider does not support TTL, fall back to the TTL value. This enables the functionality, if it is not supported you have a valid value to use and most importantly, does not change the value of data coming back from the provider when querying the config.

istr commented 9 months ago

I would argue that mangling values and thus creating a false "out of sync" may not be the best approach. Therefore, I think it is still a valid consideration to revert the mangling.

However, the stated objective of octodns is to achieve a consistent result across a large number of vendors with a single configuration. This is where octodns is good and where it differs from other solutions such as terraform.

This inevitably leads to a "lowest common denominator" compromise. Features that are only offered by a few or individual vendors are obviously not considered in this approach.

If the goal is to support the full feature set of a single vendor (or a small set of vendors), there may be other software (such as terraform) that is structurally better suited to do this, because the configuration there responds individually to each vendor. On the other hand, the user must strive for consistency in the configuration (or struggle with it there).

ross commented 9 months ago

I would argue that mangling values and thus creating a false "out of sync" may not be the best approach. Therefore, I think it is still a valid consideration to revert the mangling.

The mangling is happening in CF's api where it's returning 1 to mean auto b/c there's no way to denote "auto" in an integer field. 1 isn't the "right" value it's just what CF stuck there. It's not actually a 1s TTL (it's 300s.)

It's important to take the 1 -> 300 bit in context of octoDNS (dump.) For example imagine someone is moving away from Cloudflare to Route53. They do an octodns-dump to get the current state of what's in CF. That creates a YAML file with all the record data. Assuming they have AUTO values there'll be a bunch of stuff with 1s in the dumped YAML which when you try and push to Route53 (or anywhere else) won't make any sense and won't match CF's behavior.

If some sort of explicit auto specific to CloudflareProvider was going to be implemented it'd be configured under octodns.cloudflare, maybe

something:
    octodns:
        cloudflare:
            auto-ttl: true
    type: ...
    ttl: <ignored-for-cf-used-by-everyone-else>
    value: ....

The 1 -> 300 swap would still happen when the YAML is dumped and the octodns.cloudflare.auto-ttl: true bit could be added as well. The provider could then ignore the configured TTL and force it to 1 when populating and thus not show any diffs (be ignoring the config's TTL.) It could also push new auto values as well.

ross commented 9 months ago

That'd also need to place nicely with octodns.cloudflare.proxied, I guess whenever it's set auto-ttl would be implied, but auto-ttl could be set on non-proxied records which would be 300s for all intents and purposes.

ross commented 8 months ago

Should be addressed as of https://github.com/octodns/octodns-cloudflare/pull/65

octodns / octodns-cloudflare

Cloudflare TTL 'auto' differences #62