octodns / octodns-ns1

Ns1Provider provider for octoDNS
MIT License
4 stars 12 forks source link

Pools / rules order conflict with enforce_order flag #35

Closed istr closed 1 year ago

istr commented 1 year ago

Observed behavior: The ns1 provider does not return the pools and rules in strict order. As a consequence, with the enforce_order flag set, every run creates a change, even though nothing changed.

Expected behavior: The order of the pools and rules does not affect the detection of changes.

I am using octodns and the ns1 provider with the following versions:

pip list | grep -E "ns1|octo"
ns1-python          0.18.0
octodns             0.9.21
octodns-constellix  0.0.3
octodns-dnsmadeeasy 0.0.1
octodns-ns1         0.0.3

The config provider has enforced order set:

providers:
  config:
    class: octodns.provider.yaml.YamlProvider
    directory: ./config
    default_ttl: 3600
    enforce_order: true

The following entry will produce a change with every run:

cdn-dev:
  - dynamic:
      pools:
        af:
          values:
            - status: up
              value: 68.70.192.128
        apac:
          values:
            - status: up
              value: 89.187.160.25
            - status: up
              value: 89.187.160.31
        eu:
          values:
            - status: up
              value: 185.172.148.128
            - status: up
              value: 185.172.148.132
        eu_E:
          values:
            - status: up
              value: 137.74.0.47
            - status: up
              value: 146.59.70.106
        eu_SW:
          values:
            - status: up
              value: 82.102.17.19
        eu_fr:
          values:
            - status: up
              value: 62.210.90.237
            - status: up
              value: 62.210.90.238
        eu_nl:
          values:
            - status: up
              value: 185.172.149.128
            - status: up
              value: 185.172.149.132
        eu_uk:  # also used for af
          values:
            - status: up
              value: 68.70.192.128
        na:
          values:
            - status: up
              value: 138.199.57.66
            - status: up
              value: 138.199.57.67
        ocean:
          values:
            - status: up
              value: 103.231.91.78
        sa:
          values:
            - status: up
              value: 87.249.130.6
      rules:
        # 'AF': 14, # Continental Africa
        # 'AN': 17, # Continental Antarctica
        # 'AS': 15, # Continental Asia
        # 'EU': 13, # Continental Europe
        # 'NA': 11, # Continental North America
        # 'OC': 16, # Continental Australia/Oceania
        # 'SA': 12, # Continental South America
        - geos:
            - AF
          pool: af  # from UK, probably using 2Africa submarine cable
        - geos:
            - AS
          pool: apac
        - pool: eu
        - geos:
            - EU-UA
            - EU-PL
            - EU-RU
            - EU-LT
            - EU-LV
            - EU-EE
          pool: eu_E
        - geos:
            - EU-AD
            - EU-ES
            - EU-PT
          pool: eu_SW
        - geos:
          - EU-FR
          - EU-MC
          pool: eu_fr
        - geos:
          - EU-NL
          pool: eu_nl
        - geos:
            - EU-GB
          pool: eu_uk
        - geos:
            - NA
          pool: na
        - geos:
            - OC
          pool: ocean
        - geos:
            - SA
          pool: sa
    ttl: 60
    type: A
    values:
      - 185.172.148.128
      - 185.172.148.132

This leads reproducibly to the following diff:

* ns1 (Ns1Provider)
*   Update
*     <ARecord A 60, cdn-dev.ingo-struck.com., ['185.172.148.128', '185.172.148.132'], {'af': {'fallback': None, 'values': [{'value': '68.70.192.128', 'weight': 1, 'status': 'up'}]}, 'apac': {'fallback': None, 'values': [{'value': '89.187.160.25', 'weight': 1, 'status': 'up'}, {'value': '89.187.160.31', 'weight': 1, 'status': 'up'}]}, 'eu_E': {'fallback': None, 'values': [{'value': '137.74.0.47', 'weight': 1, 'status': 'up'}, {'value': '146.59.70.106', 'weight': 1, 'status': 'up'}]}, 'eu_SW': {'fallback': None, 'values': [{'value': '82.102.17.19', 'weight': 1, 'status': 'up'}]}, 'eu': {'fallback': None, 'values': [{'value': '185.172.148.128', 'weight': 1, 'status': 'up'}, {'value': '185.172.148.132', 'weight': 1, 'status': 'up'}]}, 'eu_fr': {'fallback': None, 'values': [{'value': '62.210.90.237', 'weight': 1, 'status': 'up'}, {'value': '62.210.90.238', 'weight': 1, 'status': 'up'}]}, 'eu_nl': {'fallback': None, 'values': [{'value': '185.172.149.128', 'weight': 1, 'status': 'up'}, {'value': '185.172.149.132', 'weight': 1, 'status': 'up'}]}, 'eu_uk': {'fallback': None, 'values': [{'value': '68.70.192.128', 'weight': 1, 'status': 'up'}]}, 'na': {'fallback': None, 'values': [{'value': '138.199.57.66', 'weight': 1, 'status': 'up'}, {'value': '138.199.57.67', 'weight': 1, 'status': 'up'}]}, 'ocean': {'fallback': None, 'values': [{'value': '103.231.91.78', 'weight': 1, 'status': 'up'}]}, 'sa': {'fallback': None, 'values': [{'value': '87.249.130.6', 'weight': 1, 'status': 'up'}]}}, [{'pool': 'af', 'geos': ['AF']}, {'pool': 'apac', 'geos': ['AS']}, {'pool': 'sa', 'geos': ['SA']}, {'pool': 'eu'}, {'pool': 'eu_E', 'geos': ['EU-EE', 'EU-LT', 'EU-LV', 'EU-PL', 'EU-RU', 'EU-UA']}, {'pool': 'eu_SW', 'geos': ['EU-AD', 'EU-ES', 'EU-PT']}, {'pool': 'eu_fr', 'geos': ['EU-FR', 'EU-MC']}, {'pool': 'eu_nl', 'geos': ['EU-NL']}, {'pool': 'eu_uk', 'geos': ['EU-GB']}, {'pool': 'na', 'geos': ['NA']}, {'pool': 'ocean', 'geos': ['OC']}]> ->
*     <ARecord A 60, cdn-dev.ingo-struck.com., ['185.172.148.128', '185.172.148.132'], {'af': {'fallback': None, 'values': [{'value': '68.70.192.128', 'weight': 1, 'status': 'up'}]}, 'apac': {'fallback': None, 'values': [{'value': '89.187.160.25', 'weight': 1, 'status': 'up'}, {'value': '89.187.160.31', 'weight': 1, 'status': 'up'}]}, 'eu': {'fallback': None, 'values': [{'value': '185.172.148.128', 'weight': 1, 'status': 'up'}, {'value': '185.172.148.132', 'weight': 1, 'status': 'up'}]}, 'eu_E': {'fallback': None, 'values': [{'value': '137.74.0.47', 'weight': 1, 'status': 'up'}, {'value': '146.59.70.106', 'weight': 1, 'status': 'up'}]}, 'eu_SW': {'fallback': None, 'values': [{'value': '82.102.17.19', 'weight': 1, 'status': 'up'}]}, 'eu_fr': {'fallback': None, 'values': [{'value': '62.210.90.237', 'weight': 1, 'status': 'up'}, {'value': '62.210.90.238', 'weight': 1, 'status': 'up'}]}, 'eu_nl': {'fallback': None, 'values': [{'value': '185.172.149.128', 'weight': 1, 'status': 'up'}, {'value': '185.172.149.132', 'weight': 1, 'status': 'up'}]}, 'eu_uk': {'fallback': None, 'values': [{'value': '68.70.192.128', 'weight': 1, 'status': 'up'}]}, 'na': {'fallback': None, 'values': [{'value': '138.199.57.66', 'weight': 1, 'status': 'up'}, {'value': '138.199.57.67', 'weight': 1, 'status': 'up'}]}, 'ocean': {'fallback': None, 'values': [{'value': '103.231.91.78', 'weight': 1, 'status': 'up'}]}, 'sa': {'fallback': None, 'values': [{'value': '87.249.130.6', 'weight': 1, 'status': 'up'}]}}, [{'pool': 'af', 'geos': ['AF']}, {'pool': 'apac', 'geos': ['AS']}, {'pool': 'eu'}, {'pool': 'eu_E', 'geos': ['EU-EE', 'EU-LT', 'EU-LV', 'EU-PL', 'EU-RU', 'EU-UA']}, {'pool': 'eu_SW', 'geos': ['EU-AD', 'EU-ES', 'EU-PT']}, {'pool': 'eu_fr', 'geos': ['EU-FR', 'EU-MC']}, {'pool': 'eu_nl', 'geos': ['EU-NL']}, {'pool': 'eu_uk', 'geos': ['EU-GB']}, {'pool': 'na', 'geos': ['NA']}, {'pool': 'ocean', 'geos': ['OC']}, {'pool': 'sa', 'geos': ['SA']}]> (config)
*   Summary: Creates=0, Updates=1, Deletes=0, Existing Records=44
ross commented 1 year ago

I can reproduce, but even stranger I'm seeing persistent changes on records that didn't have that problem the last time I used them. Still digging into it.

ross commented 1 year ago

After some initial digging this seems to be related to the monitoring part of dyanmic records. If I push things with status: up it'll get to a settled state and show no changes. When the records include health checks they seem to be persistently showing changes. My guess is that NS1 has changed some details about the health checks/api, but I haven't gotten far enough into things yet to know if that's true.

ross commented 1 year ago

Yeah, the update is coming from _extra_changes

2023-03-28T07:55:32  [4644720128] INFO  Ns1Provider[ns1] _extra_changes: monitor mis-match for a.exxampled.com - A - 1.1.1.1

Which is triggered when _monitor_is_match returns False:

https://github.com/octodns/octodns-ns1/blob/bc542be4ac73d3914adcfd5d1ce2f56ff484e119/octodns_ns1/__init__.py#L1196-L1203

Printing out the two

{'expected': {'active': True,
              'config': {'connect_timeout': 2000,
                         'host': '1.1.1.1',
                         'port': 443,
                         'response_timeout': 10000,
                         'send': 'GET /_dns HTTP/1.0\\r\\nHost: '
                                 'a.exxampled.com\\r\\nUser-agent: '
                                 'NS1\\r\\n\\r\\n',
                         'ssl': True},
              'frequency': 60,
              'job_type': 'tcp',
              'name': 'a.exxampled.com - A - 1.1.1.1',
              'notes': 'host:a.exxampled.com type:A',
              'policy': 'quorum',
              'rapid_recheck': False,
              'region_scope': 'fixed',
              'regions': ['lga'],
              'rules': [{'comparison': 'contains',
                         'key': 'output',
                         'value': '200 OK'}]},
 'have': {'active': True,
          'config': {'connect_timeout': 2000,
                     'host': '1.1.1.1',
                     'ipv6': False,
                     'port': 443,
                     'response_timeout': 10000,
                     'send': 'GET /_dns HTTP/1.0\\r\\nHost: '
                             'a.exxampled.com\\r\\nUser-agent: NS1\\r\\n\\r\\n',
                     'ssl': True,
                     'tls_add_verify': False},
          'frequency': 60,
          'id': '6422ff1c65f383007cace839',
          'job_type': 'tcp',
          'mute': False,
          'name': 'a.exxampled.com - A - 1.1.1.1',
          'notes': 'host:a.exxampled.com type:A',
          'notify_delay': 0,
          'notify_failback': True,
          'notify_list': '62f29339f6491b0095148771',
          'notify_regional': False,
          'notify_repeat': 0,
          'policy': 'quorum',
          'rapid_recheck': False,
          'region_scope': 'fixed',
          'regions': ['lga'],
          'rules': [{'comparison': 'contains',
                     'key': 'output',
                     'value': '200 OK'}],
          'status': {'global': {'fail_set': ['lga'],
                                'since': 1680015196,
                                'status': 'down'},
                     'lga': {'fail_set': ['Failure for Rule: output contains '
                                          '200 OK'],
                             'since': 1680015196,
                             'status': 'down'}}}}
{'have.get': {'connect_timeout': 2000,
              'host': '1.1.1.1',
              'ipv6': False,
              'port': 443,
              'response_timeout': 10000,
              'send': 'GET /_dns HTTP/1.0\\r\\nHost: '
                      'a.exxampled.com\\r\\nUser-agent: NS1\\r\\n\\r\\n',
              'ssl': True,
              'tls_add_verify': False},
 'k': 'config',
 'mismatch': True,
 'v': {'connect_timeout': 2000,
       'host': '1.1.1.1',
       'port': 443,
       'response_timeout': 10000,
       'send': 'GET /_dns HTTP/1.0\\r\\nHost: a.exxampled.com\\r\\nUser-agent: '
               'NS1\\r\\n\\r\\n',
       'ssl': True}}

Looks like they've added a new keys to the config section: tls_add_verify and ipv6 and the current method of comparison sees that as a diff.

Working on a PR/solution now.

ross commented 1 year ago

rule order

The order of the pools and rules does not affect the detection of changes.

(not directly related to the issue, but to clarify)

The order of rules does matter and should be preserved. They are intended to be applied in the order listed. That allows targeting sub-locations within a larger geo, e.g. rule 0 sends EU-ES to a pool in Spain and then rule 1 sends the rest of EU to a pool in Germany.

I originally ran through options to try and automagically sort the rules, but quickly realized that there could be conflicts that couldn't be resolved, e.g.

- geos: [NA-US, EU]
  pool: eu
- goes: [EU-ES, NA]
  pool: na

That config doesn't make sense and EU-ES would not be sent to na as things currently work, but if automagically sorted which one would come first and thus which geo wouldn't get targeted?

The new validations in https://github.com/octodns/octodns/issues/989 which warn users about this sort of problem, but I still think explicit ordering is preferable.

pool order

pools are a dictionary so their order doesn't matter, octoDNS's preferred order is sorted keys.

this issue

I've reproduced the problem, even when running on the branch in https://github.com/octodns/octodns-ns1/pull/39, but it seems to happen regardless of whether or not I set enforce_order.

I'm seeing a diff both in the pool order and rule order. They're actually coming from 2 different things.

pool ordering

This one is an issue, but I believe should be solved in octoDNS core. It should persistently order the pools internally so that variation in their order don't matter, the way it already deals with differences in case, idna, etc.

rule ordering

This is happening b/c the catch-all rule isn't the last rule in the chain. https://github.com/octodns/octodns/issues/989 will warn about this issue in the near future so there's not anything to fix.

The documentation added in https://github.com/octodns/octodns/pull/991 talks about best practices/requirements for dynamic rules. It also talks about "unexpected" behaviors from providers if lenient is enabled for such records and this would be a case of that.

plan

I need to do a bit more digging and testing, but at this point I believe there will be a PR to octodns/octodns to fix the pool ordering issue. I don't believe there will be any changes around the rule ordering.

ross commented 1 year ago

I don't believe there will be any changes around the rule ordering.

Well that was not correct.

Turns out there's a completely unrelated issue with Ns1Provider rule ordering. We stuff a rule order into the notes field and pull it back out to sort the rules so they're in the expected order:

https://github.com/octodns/octodns-ns1/blob/bc542be4ac73d3914adcfd5d1ce2f56ff484e119/octodns_ns1/__init__.py#L740-L754

There's a super subtle bug in that code... The meta.note field where the order data is stored is a string. The parsing code leaves it as-is so rule_order is a string. When there are > 10 rules (0 indexed) the 11th one has the order "10" which doesn't sort numerically so we get 0, 1, 10, 2, 3, ...

Trivial fix/PR to that incoming.

ross commented 1 year ago

I need to do a bit more digging and testing, but at this point I believe there will be a PR to octodns/octodns to fix the pool ordering issue

And after fixing the rule ordering int bug it and looking in octoDNS core it turns out the fix I thought might be need there is already in place https://github.com/octodns/octodns/blob/1f8d7ade33c1d84447e134629bace1ce9a0d4e37/octodns/record/dynamic.py#L296-L297.

So I believe this is actually just a bug in the rule-order handling.

istr commented 1 year ago

So I believe this is actually just a bug in the rule-order handling.

I can confirm this. #35 is fixed with #42.

istr commented 1 year ago

I originally ran through options to try and automagically sort the rules, but quickly realized that there could be conflicts that couldn't be resolved, e.g.

- geos: [NA-US, EU]
  pool: eu
- geos: [EU-ES, NA]
  pool: na

That config doesn't make sense and EU-ES would not be sent to na as things currently work, but if automagically sorted which one would come first and thus which geo wouldn't get targeted?

I think that this problem could be solved using one of the following two rulesets (probably kind of what you implemented in https://github.com/octodns/octodns/pull/991)

ruleset A)

result: your example would be rejected

ruleset B)

result: your example would be resolved to:

- geos: [NA-US]
  pool: eu
- geos: [EU]
  pool: eu
- geos: [EU-ES]
  pool: na
- geos: [NA]
  pool: na

(first step)

- geos: [NA-US]
  pool: eu
- geos: [EU-ES]
  pool: na
- geos: [EU]
  pool: eu
- geos: [NA]
  pool: na

(third step)

This would be unambiguous and probably what was intended behavior.

Note that it would even make no difference if the sorting would be unstable in this scenario.

This logic could also be moved back into octodns core, so the providers need only implement how to configure the pools and the rules. The semantics of pool / rule handling would then become completely consistent across providers.

ross commented 1 year ago

ruleset A) ... ruleset B) ...

Unfortunately pools cannot be reused in rules (except when the 2nd use is as the catch-all.) There's an existing validation in place to ensure this that has been there for a long time, if not as long as dynamic was a thing.

This comes about b/c several providers implementations don't allow re-using pools. In some cases it might be possible to work around that by creating fake pools behind the scenes, but without paging all the details of all the dynamic providers back in to memory I don't know if that would be possible for all (probably not, guessing that's why the rule came into place.)

istr commented 1 year ago

This comes about b/c several providers implementations don't allow re-using pools.

Ok, I see. Then the idea (B) is not feasible. :slightly_smiling_face:

But I still think that rejecting "mixed" configurations (that are problematic) are still worth a warning or a rejection (A) in some kind of strict mode.