sapcc / octavia-f5-provider-driver

Apache License 2.0
15 stars 2 forks source link

Big-IP config sync fails with "user-defined" entries in health monitor configuration #226

Open m-kratochvil opened 1 year ago

m-kratochvil commented 1 year ago

When a health monitor of type http is configured (in a Big-IP TMOS) with user-defined options related to TLS, it causes config sync failure when a standby Big-IP tries to load and parse the config from the active unit.

Example problematic configuration:

ltm monitor http /net_96409ead_4415_41ab_83ef_465773ee9487/lb_dd30d753-71fa-414e-bc2a-bfd5f1dbabea/hm_4618ffce-4633-4857-a348-6bc4d4cd03ce {
    adaptive disabled
    adaptive-limit 1000
    adaptive-sampling-timespan 180
    defaults-from /Common/http
    destination 10.180.240.29:50000
    interval 10
    ip-dscp 0
    recv "HTTP/1.(0|1) 200"
    recv-disable none
    send "GET /FormsProcessing/AdobeDocumentServices/AvailabilityCheck?j_username=ads_lb_user&j_password=JifnU4us HTTP/1.0\r\n\r\n"
    time-until-up 0
    timeout 31
    user-defined CIPHERLIST DEFAULT
    user-defined SSL_PROFILE_NAME /Common/cc_serverssl_profile
}

These are the offending lines:

    user-defined CIPHERLIST DEFAULT
    user-defined SSL_PROFILE_NAME /Common/cc_serverssl_profile

The above example was present in eu-de-2 on the active Big-IP eu-de-2-lb014a-02, loadbalancer dd30d753-71fa-414e-bc2a-bfd5f1dbabea.

We have had very similar issue back in 2020, troubleshooted in F5 case #C3296266. It was found that these config lines appear when a health monitor of certain type is created, then deleted and then created again with the same name but different type, and if there is no config sync after the delete action. F5 filed a bug ID https://cdn.f5.com/product/bugtracker/ID423304.html

I opened a new case #00409954 for the current issue and will try to reproduce it following the description in the above bug.

Open questions:

Follow up tasks:

m-kratochvil commented 1 year ago

The alert for failed sync does exist (defined HERE but it's set to severity info, that's why it was missed. It was supposed to be raised to severity warning but simply forgotten. I will raise the severity accordingly.

m-kratochvil commented 1 year ago

I could not reproduce the issue in qa-de-1, based on the behavior described in the mentioned old bug ID 423304.

Test steps I tried:

  1. configure HTTPS type monitor
  2. disable automatic config sync
  3. delete the monitor
  4. create new monitor, with the same name but type HTTP
  5. re-enable automatic config sync
  6. verify the monitor is present on the standby device, and check the configuration

Every attempt resulted in correct HM configuration on both devices.

However, I can’t really simulate exactly what I saw in production as I can’t configure the user-defined entries from CLI or UI, there is no such option. This leads me to believe these entries must have been introduced inside AS3, either automatically, perhaps matching some condition, or explicitly, set by the provider driver.

Next I will investigate AS3, whether these user-defined options exist there.

m-kratochvil commented 1 year ago

Searched through the provider driver, octavia and helm charts repositories, did not find any reference to user-defined entries mentioned above. Also searched F5 TMOS and AS3 documentation, found nothing except an older bug that has distant similarity to the issue described here.

I will take this back to F5 stating these config entries were not pushed into the Big-IP configuration by us and must have been "injected" by the Big-IP itself. But before I do so @BenjaminLudwigSAP can you please check yourself to confirm my findings?

BenjaminLudwigSAP commented 1 year ago

how did the user-defined entries get into the Big-IP configuration?

I can't find any field by the name of user-defined in F5PD/charts/secrets either. However, since one of the offending lines contains the cc_serverssl_profile profile, I searched for that as well. You can find it in the charts here where it's the value of the profile_healthmonitor_tls key which is templated into the Octavia configuration here (definition on Octavia side is here) and used in Octavia F5PD here to set the monitor clientTLS parameter of the AS3 Monitor class. The AS3 reference does not contain user-defined anywhere as well.

are these configured by the provider driver under some conditions or does it happen locally on the Big-IP?

Therefore I conclude that it is something on the BigIP (possibly the cc_serverssl_profile profile) that introduces that weird user-defined thing. When looking on a BigIP (via GUI) I can see the monitor cc_https_monitor, which indeed has SSL Profile set to cc_serverssl_profile. I suppose that setting is reflected via the user-defined parameter in the config? Maybe only under certain circumstances? Maybe only for newer/specific versions or something?

notandy commented 1 year ago
  • how did the user-defined entries get into the Big-IP configuration?

Right, we had a similar issue in 2020, in fact we didn't deleted the monitor but changed it (because for example a user changed the type of the monitor).

existing attributes that were no longer supported went into this user-defined options, so it was nothing we explicitly done in AS3, but AS3 did. I assume something like this is happening again. Should be easy to reproduce.

The driver itself doesn't know/uses user-defined by itself (also, AS3 has no option for that).

  • are these configured by the provider driver under some conditions or does it happen locally on the Big-IP?

locally on Big-IP

  • is there a way for a user to configure these settings (indirectly)

Nope.

m-kratochvil commented 1 year ago

Thank you both for spending the time to give me a feedback on this. With your comments I think it's becoming clear the bug we saw back in 2020 resurfaced. The http type monitor in my example above was indeed originally created as https type monitor and later changed to http type by customer. Hence the SSL settings (not applicable at all to monitor type http).

I will give it another try to reproduce the issue, I also think it should be easy to do so, unless there are more specific conditions than just changing the monitor type while keeping the name.

In any case, I will take this back to F5 and see if we can get permanent fix.

m-kratochvil commented 1 year ago

I finally found how to reproduce the issue. It seems to only happen when a custom monitor per pool member is used. I could not reproduce it with single, pool based monitor.

Here is how to reproduce:

  1. Create HTTPS type monitor for a pool
  2. Edit a specific pool member and add a custom monitor IP/port (or both, doesn't matter)
  3. Delete the pool based health monitor (because you want to change its type)
  4. Create new - HTTP type monitor for the pool (do not touch the poolmember-based monitor)

The new, pool based health monitor has correct configuration, but the custom, poolmember-based one, ends up with the user-defined config lines.

The scenario exactly matches the customer setup where it happened in prod: https://dashboard.eu-de-2.cloud.sap/cis/clmam-eu-de-2-vlab/lbaas2/?r=/loadbalancers/dd30d753-71fa-414e-bc2a-bfd5f1dbabea/show?pool=569b5c2e-a6f2-4880-8693-5cef5d9241c9

As we saw back in 2020, this affects all types of monitors, not just HTTP/HTTPS.

The spot-fix is to re-create (delete/create) the custom monitor from the pool member, which removes the affected monitor object from the Big-IP and creates new one with correct configuration.

Assumption - question: In our implementation of custom, poolmember-based monitors, there is likely some sort of "inheritance" or "sync" of properties from the pool-based monitor. In cases where the pool-based monitor changes its type, the poolmember-based monitor changes its type as well, and that's where the user-defined properties are introduced. I need help here again from @BenjaminLudwigSAP and/or @notandy to understand how this inheritance/update mechanism work and whether there's something on our side to change.

The related TMSH outputs

  1. The original pool based monitor:

    ltm monitor https /net_370b0255_eb12_41c4_bd39_587fa7012219/<example-pool-based> {
    adaptive disabled
    adaptive-limit 1000
    adaptive-sampling-timespan 180
    cipherlist DEFAULT
    defaults-from https
    destination *:*
    interval 5
    ip-dscp 0
    partition net_370b0255_eb12_41c4_bd39_587fa7012219
    recv "HTTP/1.(0|1) 200"
    recv-disable none
    send "GET / HTTP/1.0\r\n\r\n"
    ssl-profile /Common/cc_serverssl_profile
    time-until-up 0
    timeout 16
    }
  2. Original poolmember based monitor:

    ltm monitor https /net_370b0255_eb12_41c4_bd39_587fa7012219/<example-poolmember-based> {
    adaptive disabled
    adaptive-limit 1000
    adaptive-sampling-timespan 180
    cipherlist DEFAULT
    defaults-from https
    destination 10.180.88.186%2011:irdmi
    interval 5
    ip-dscp 0
    partition net_370b0255_eb12_41c4_bd39_587fa7012219
    recv "HTTP/1.(0|1) 200"
    recv-disable none
    send "GET / HTTP/1.0\r\n\r\n"
    ssl-profile /Common/cc_serverssl_profile
    time-until-up 0
    timeout 16
    }
  3. The newly created pool based monitor:

    ltm monitor http /net_370b0255_eb12_41c4_bd39_587fa7012219/<example-pool-based> {
    adaptive disabled
    adaptive-limit 1000
    adaptive-sampling-timespan 180
    defaults-from http
    destination *:*
    interval 5
    ip-dscp 0
    partition net_370b0255_eb12_41c4_bd39_587fa7012219
    recv "HTTP/1.(0|1) 200"
    recv-disable none
    send "GET / HTTP/1.0\r\n\r\n"
    time-until-up 0
    timeout 16
    }
  4. The newly created (automatically) poolmember-based monitor:

    ltm monitor http /net_370b0255_eb12_41c4_bd39_587fa7012219/<example-poolmember-based> {
    adaptive disabled
    adaptive-limit 1000
    adaptive-sampling-timespan 180
    defaults-from http
    destination 10.180.88.186%2011:irdmi
    interval 5
    ip-dscp 0
    partition net_370b0255_eb12_41c4_bd39_587fa7012219
    recv "HTTP/1.(0|1) 200"
    recv-disable none
    send "GET / HTTP/1.0\r\n\r\n"
    time-until-up 0
    timeout 16
    user-defined CIPHERLIST DEFAULT
    user-defined SSL_PROFILE_NAME /Common/cc_serverssl_profile
    }
m-kratochvil commented 1 year ago

I tried extensively but could not reproduce the issue by using plain tmsh commands on the Big-IP. Next I will try with AS3 but outside of Openstack.

m-kratochvil commented 1 year ago

I have found an existing issue in the AS3 repo (issue 658) about this behavior, so I updated it with my findings.