redhat-cop / keepalived-operator

An operator to manage VIPs backed by keepalived
Apache License 2.0
117 stars 36 forks source link

Enable interface autodiscovery based on a reachable IP #61

Closed tommasopozzetti closed 3 years ago

tommasopozzetti commented 3 years ago

Keepalived-operator assumes that all nodes in a given KeepalivedGroup will have the same networking configurations and the same interface names. This is not always true and it makes for a harder setup in non-homogeneous environments.

This PR introduces an extra optional field on the CRD called interfaceFromIP. If this field is specified, the interface field will be ignored and the interface name will instead be autodiscovered on each node based on which interface would be used to reach the specified IP.

This PR requires the keepalived-operator image to contain the ip, sed and awk binaries on the path.

This PR should address and fix #11

tommasopozzetti commented 3 years ago

@raffaelespazzoli Thank you for the review! I edited the docs to include this new feature. Regarding your questions: 1) The initContainer is not strictly needed, but it helps avoiding a race condition that can lead to the keepalived container restarting a few times. The issue is that, with this new design, upon startup the config file does not exist in the keepalived container and it will not exist until the config-reloader does its first cycle to overwrite the interfaces and copy the config file. Even with tweaking the notify.sh script to perform an update cycle right at the start, I wasn't able to avoid keepalived panicking due to the missing configuration file a couple of times until it eventually gets copied and keepalived is able to start. The issue does eventually solve itself in a few seconds, but I thought it would be more elegant to avoid the restarts to begin with. Therefore, the initContainer basically takes care of the first setup cycle before any other container runs, to ensure that keepalived will start with its configurations available. 2) The setup env variable might be a bit misleading in its name but it basically simply controls this behavior I just described, not really the interface overriding behavior. In the initContainer, the script needs to run the same cycle as the usual config-reloader to setup the configs and possibly override the interface name, but it only needs to do it once (no while true loop) and it does not need to send a signal at the end of the cycle since no other process is listening yet. Therefore, the setup env variable controls whether the script is launched in "setup" mode (one-shot without signal sending) or in the usual infinite loop mode (hence why it is true in the initContainer and false in the sidecar). 3) In case the IP in the CR changes, the template will need to be reevaluated and the pod spec will differ since the env variable reachip will change. Therefore, the keepalived pods will be updated with the new IP and, upon restart, they will have the correct interfaces based on the new IP.

tommasopozzetti commented 3 years ago

@raffaelespazzoli I should have included all the requested changes. Let me know if it looks good now Regarding the set -o nounset question, I don't believe it should interfere, since the function is only called after the set directive, it should still benefit from checking for undefined variables (and my tests seemed to confirm this). However, I am also a bit rusty on bash so I'm not 100% sure.

raffaelespazzoli commented 3 years ago

hang on, this is going to be tested in the next few days.

David-Igou commented 3 years ago

Sorry for the delay on a review @tommasopozzetti

I am testing your branch now and it appears all pods are hanging in init :

keepalivedgroup-test-5dfr6          0/3     Init:0/1    0          12m
keepalivedgroup-test-7mg8w          0/3     Init:0/1    0          12m
keepalivedgroup-test-master-4nbvp   0/3     Init:0/1    0          12m
keepalivedgroup-test-master-986gt   0/3     Init:0/1    0          12m
keepalivedgroup-test-master-lbxpc   0/3     Init:0/1    0          12m

I am attempting to test this feature and am unable to, here is my keepalivedgroup:

apiVersion: redhatcop.redhat.io/v1alpha1
kind: KeepalivedGroup
metadata:
  name: keepalivedgroup-test
spec:
  interface: foo
  interfaceFromIP: 10.0.140.4
  nodeSelector:
    node-role.kubernetes.io/worker: ""

In addition, I have another keepalivedgroup scheduled for another set of nodes, this group using a standard configuration:

apiVersion: redhatcop.redhat.io/v1alpha1
kind: KeepalivedGroup
metadata:
  name: keepalivedgroup-test-master
spec:
  interface: ens5
  nodeSelector:
    node-role.kubernetes.io/master: ""

I am unable to pull any logs from the initContainer. This is on Openshift 4.6, do you know what might be causing this behavior?

tommasopozzetti commented 3 years ago

@David-Igou thanks for testing this. I have only tested on Kubernetes, since I don't have access to an OpenShift cluster. Are you able to describe the pods to understand if the init container is at least starting or what the error is if not? If it is starting you should be able to pull logs from it by specifying the initContainer name

kubectl get logs -c config-setup <pod-name>

If the initContainer is not CrashLooping though I'm going to guess it's not even starting so getting the events should help understanding why that is

kubectl describe <pod-name>
David-Igou commented 3 years ago

@tommasopozzetti disregard, that was an oversight on my part when deploying the operator via OLM.

David-Igou commented 3 years ago

@tommasopozzetti Please add the changes needed to the Dockerfile for this PR to work

tommasopozzetti commented 3 years ago

@David-Igou I am not sure what binaries are already available in the base image used in this Dockerfile. I added all needed packages, let me know if any of them were already available and we can remove them. I have also pushed a fix to use grep rather than awk to parse the output of the ip command since I realized that if that the specified IP is not directly reachable the output is different and it would have broken the logic. Now it should work with any IP.

tommasopozzetti commented 3 years ago

@David-Igou disregard my previous comment. I was able to check the packages available in the base image and correct the Dockerfile to add the right ones.

David-Igou commented 3 years ago

LGTM

flybyray commented 3 years ago

@raffaelespazzoli I should have included all the requested changes. Let me know if it looks good now Regarding the set -o nounset question, I don't believe it should interfere, since the function is only called after the set directive, it should still benefit from checking for undefined variables (and my tests seemed to confirm this). However, I am also a bit rusty on bash so I'm not 100% sure.

I receive

/usr/local/bin/notify.sh: line 20: create_config_only: unbound variable

found on openshift in logs of container config-reloader

tommasopozzetti commented 3 years ago

@flybyray could you confirm if your keepalived pods' spec contain the environment variable create_config_only both in the initContainer and in the sidecar container (as seen here and here)?

flybyray commented 3 years ago

no those env variables were not available. in the end, we just deleted the keepalived operator and reinstalled the new version than everything worked. just the upgrade was not possible.