Open m-kratochvil opened 1 year ago
The alert for failed sync does exist (defined HERE but it's set to severity info
, that's why it was missed. It was supposed to be raised to severity warning
but simply forgotten.
I will raise the severity accordingly.
I could not reproduce the issue in qa-de-1, based on the behavior described in the mentioned old bug ID 423304.
Test steps I tried:
Every attempt resulted in correct HM configuration on both devices.
However, I can’t really simulate exactly what I saw in production as I can’t configure the user-defined
entries
from CLI or UI, there is no such option. This leads me to believe these entries must have been introduced inside AS3, either automatically, perhaps matching some condition, or explicitly, set by the provider driver.
Next I will investigate AS3, whether these user-defined
options exist there.
Searched through the provider driver, octavia and helm charts repositories, did not find any reference to user-defined
entries mentioned above. Also searched F5 TMOS and AS3 documentation, found nothing except an older bug that has distant similarity to the issue described here.
I will take this back to F5 stating these config entries were not pushed into the Big-IP configuration by us and must have been "injected" by the Big-IP itself. But before I do so @BenjaminLudwigSAP can you please check yourself to confirm my findings?
how did the user-defined entries get into the Big-IP configuration?
I can't find any field by the name of user-defined
in F5PD/charts/secrets either.
However, since one of the offending lines contains the cc_serverssl_profile
profile, I searched for that as well. You can find it in the charts here where it's the value of the profile_healthmonitor_tls
key which is templated into the Octavia configuration here (definition on Octavia side is here) and used in Octavia F5PD here to set the monitor clientTLS
parameter of the AS3 Monitor class.
The AS3 reference does not contain user-defined
anywhere as well.
are these configured by the provider driver under some conditions or does it happen locally on the Big-IP?
Therefore I conclude that it is something on the BigIP (possibly the cc_serverssl_profile
profile) that introduces that weird user-defined
thing. When looking on a BigIP (via GUI) I can see the monitor cc_https_monitor
, which indeed has SSL Profile
set to cc_serverssl_profile
. I suppose that setting is reflected via the user-defined
parameter in the config? Maybe only under certain circumstances? Maybe only for newer/specific versions or something?
- how did the
user-defined
entries get into the Big-IP configuration?
Right, we had a similar issue in 2020, in fact we didn't deleted the monitor but changed it (because for example a user changed the type of the monitor).
existing attributes that were no longer supported went into this user-defined
options, so it was nothing we explicitly done in AS3, but AS3 did. I assume something like this is happening again. Should be easy to reproduce.
The driver itself doesn't know/uses user-defined
by itself (also, AS3 has no option for that).
- are these configured by the provider driver under some conditions or does it happen locally on the Big-IP?
locally on Big-IP
- is there a way for a user to configure these settings (indirectly)
Nope.
Thank you both for spending the time to give me a feedback on this. With your comments I think it's becoming clear the bug we saw back in 2020 resurfaced. The http
type monitor in my example above was indeed originally created as https
type monitor and later changed to http
type by customer. Hence the SSL settings (not applicable at all to monitor type http
).
I will give it another try to reproduce the issue, I also think it should be easy to do so, unless there are more specific conditions than just changing the monitor type while keeping the name.
In any case, I will take this back to F5 and see if we can get permanent fix.
I finally found how to reproduce the issue. It seems to only happen when a custom monitor per pool member is used. I could not reproduce it with single, pool based monitor.
Here is how to reproduce:
The new, pool based health monitor has correct configuration, but the custom, poolmember-based one, ends up with the user-defined
config lines.
The scenario exactly matches the customer setup where it happened in prod: https://dashboard.eu-de-2.cloud.sap/cis/clmam-eu-de-2-vlab/lbaas2/?r=/loadbalancers/dd30d753-71fa-414e-bc2a-bfd5f1dbabea/show?pool=569b5c2e-a6f2-4880-8693-5cef5d9241c9
As we saw back in 2020, this affects all types of monitors, not just HTTP/HTTPS.
The spot-fix is to re-create (delete/create) the custom monitor from the pool member, which removes the affected monitor object from the Big-IP and creates new one with correct configuration.
Assumption - question:
In our implementation of custom, poolmember-based monitors, there is likely some sort of "inheritance" or "sync" of properties from the pool-based monitor. In cases where the pool-based monitor changes its type,
the poolmember-based monitor changes its type as well, and that's where the user-defined
properties are introduced. I need help here again from @BenjaminLudwigSAP and/or @notandy to understand how this inheritance/update mechanism work and whether there's something on our side to change.
The related TMSH outputs
The original pool based monitor:
ltm monitor https /net_370b0255_eb12_41c4_bd39_587fa7012219/<example-pool-based> {
adaptive disabled
adaptive-limit 1000
adaptive-sampling-timespan 180
cipherlist DEFAULT
defaults-from https
destination *:*
interval 5
ip-dscp 0
partition net_370b0255_eb12_41c4_bd39_587fa7012219
recv "HTTP/1.(0|1) 200"
recv-disable none
send "GET / HTTP/1.0\r\n\r\n"
ssl-profile /Common/cc_serverssl_profile
time-until-up 0
timeout 16
}
Original poolmember based monitor:
ltm monitor https /net_370b0255_eb12_41c4_bd39_587fa7012219/<example-poolmember-based> {
adaptive disabled
adaptive-limit 1000
adaptive-sampling-timespan 180
cipherlist DEFAULT
defaults-from https
destination 10.180.88.186%2011:irdmi
interval 5
ip-dscp 0
partition net_370b0255_eb12_41c4_bd39_587fa7012219
recv "HTTP/1.(0|1) 200"
recv-disable none
send "GET / HTTP/1.0\r\n\r\n"
ssl-profile /Common/cc_serverssl_profile
time-until-up 0
timeout 16
}
The newly created pool based monitor:
ltm monitor http /net_370b0255_eb12_41c4_bd39_587fa7012219/<example-pool-based> {
adaptive disabled
adaptive-limit 1000
adaptive-sampling-timespan 180
defaults-from http
destination *:*
interval 5
ip-dscp 0
partition net_370b0255_eb12_41c4_bd39_587fa7012219
recv "HTTP/1.(0|1) 200"
recv-disable none
send "GET / HTTP/1.0\r\n\r\n"
time-until-up 0
timeout 16
}
The newly created (automatically) poolmember-based monitor:
ltm monitor http /net_370b0255_eb12_41c4_bd39_587fa7012219/<example-poolmember-based> {
adaptive disabled
adaptive-limit 1000
adaptive-sampling-timespan 180
defaults-from http
destination 10.180.88.186%2011:irdmi
interval 5
ip-dscp 0
partition net_370b0255_eb12_41c4_bd39_587fa7012219
recv "HTTP/1.(0|1) 200"
recv-disable none
send "GET / HTTP/1.0\r\n\r\n"
time-until-up 0
timeout 16
user-defined CIPHERLIST DEFAULT
user-defined SSL_PROFILE_NAME /Common/cc_serverssl_profile
}
I tried extensively but could not reproduce the issue by using plain tmsh commands on the Big-IP. Next I will try with AS3 but outside of Openstack.
I have found an existing issue in the AS3 repo (issue 658) about this behavior, so I updated it with my findings.
When a health monitor of type
http
is configured (in a Big-IP TMOS) withuser-defined
options related to TLS, it causes config sync failure when a standby Big-IP tries to load and parse the config from the active unit.Example problematic configuration:
These are the offending lines:
The above example was present in eu-de-2 on the active Big-IP
eu-de-2-lb014a-02
, loadbalancerdd30d753-71fa-414e-bc2a-bfd5f1dbabea
.We have had very similar issue back in 2020, troubleshooted in F5 case #C3296266. It was found that these config lines appear when a health monitor of certain type is created, then deleted and then created again with the same name but different type, and if there is no config sync after the
delete
action. F5 filed a bug ID https://cdn.f5.com/product/bugtracker/ID423304.htmlI opened a new case #00409954 for the current issue and will try to reproduce it following the description in the above bug.
Open questions:
user-defined
entries get into the Big-IP configuration?Follow up tasks: