Closed tommasopozzetti closed 3 years ago
@raffaelespazzoli Thank you for the review!
I edited the docs to include this new feature.
Regarding your questions:
1) The initContainer is not strictly needed, but it helps avoiding a race condition that can lead to the keepalived container restarting a few times. The issue is that, with this new design, upon startup the config file does not exist in the keepalived container and it will not exist until the config-reloader does its first cycle to overwrite the interfaces and copy the config file. Even with tweaking the notify.sh script to perform an update cycle right at the start, I wasn't able to avoid keepalived panicking due to the missing configuration file a couple of times until it eventually gets copied and keepalived is able to start. The issue does eventually solve itself in a few seconds, but I thought it would be more elegant to avoid the restarts to begin with. Therefore, the initContainer basically takes care of the first setup cycle before any other container runs, to ensure that keepalived will start with its configurations available.
2) The setup
env variable might be a bit misleading in its name but it basically simply controls this behavior I just described, not really the interface overriding behavior. In the initContainer, the script needs to run the same cycle as the usual config-reloader to setup the configs and possibly override the interface name, but it only needs to do it once (no while true
loop) and it does not need to send a signal at the end of the cycle since no other process is listening yet. Therefore, the setup
env variable controls whether the script is launched in "setup" mode (one-shot without signal sending) or in the usual infinite loop mode (hence why it is true in the initContainer and false in the sidecar).
3) In case the IP in the CR changes, the template will need to be reevaluated and the pod spec will differ since the env variable reachip
will change. Therefore, the keepalived pods will be updated with the new IP and, upon restart, they will have the correct interfaces based on the new IP.
@raffaelespazzoli I should have included all the requested changes. Let me know if it looks good now
Regarding the set -o nounset
question, I don't believe it should interfere, since the function is only called after the set
directive, it should still benefit from checking for undefined variables (and my tests seemed to confirm this). However, I am also a bit rusty on bash so I'm not 100% sure.
hang on, this is going to be tested in the next few days.
Sorry for the delay on a review @tommasopozzetti
I am testing your branch now and it appears all pods are hanging in init
:
keepalivedgroup-test-5dfr6 0/3 Init:0/1 0 12m
keepalivedgroup-test-7mg8w 0/3 Init:0/1 0 12m
keepalivedgroup-test-master-4nbvp 0/3 Init:0/1 0 12m
keepalivedgroup-test-master-986gt 0/3 Init:0/1 0 12m
keepalivedgroup-test-master-lbxpc 0/3 Init:0/1 0 12m
I am attempting to test this feature and am unable to, here is my keepalivedgroup:
apiVersion: redhatcop.redhat.io/v1alpha1
kind: KeepalivedGroup
metadata:
name: keepalivedgroup-test
spec:
interface: foo
interfaceFromIP: 10.0.140.4
nodeSelector:
node-role.kubernetes.io/worker: ""
In addition, I have another keepalivedgroup scheduled for another set of nodes, this group using a standard configuration:
apiVersion: redhatcop.redhat.io/v1alpha1
kind: KeepalivedGroup
metadata:
name: keepalivedgroup-test-master
spec:
interface: ens5
nodeSelector:
node-role.kubernetes.io/master: ""
I am unable to pull any logs from the initContainer. This is on Openshift 4.6, do you know what might be causing this behavior?
@David-Igou thanks for testing this. I have only tested on Kubernetes, since I don't have access to an OpenShift cluster. Are you able to describe the pods to understand if the init container is at least starting or what the error is if not? If it is starting you should be able to pull logs from it by specifying the initContainer name
kubectl get logs -c config-setup <pod-name>
If the initContainer is not CrashLooping though I'm going to guess it's not even starting so getting the events should help understanding why that is
kubectl describe <pod-name>
@tommasopozzetti disregard, that was an oversight on my part when deploying the operator via OLM.
@tommasopozzetti Please add the changes needed to the Dockerfile for this PR to work
@David-Igou I am not sure what binaries are already available in the base image used in this Dockerfile. I added all needed packages, let me know if any of them were already available and we can remove them. I have also pushed a fix to use grep rather than awk to parse the output of the ip command since I realized that if that the specified IP is not directly reachable the output is different and it would have broken the logic. Now it should work with any IP.
@David-Igou disregard my previous comment. I was able to check the packages available in the base image and correct the Dockerfile to add the right ones.
LGTM
@raffaelespazzoli I should have included all the requested changes. Let me know if it looks good now Regarding the
set -o nounset
question, I don't believe it should interfere, since the function is only called after theset
directive, it should still benefit from checking for undefined variables (and my tests seemed to confirm this). However, I am also a bit rusty on bash so I'm not 100% sure.
I receive
/usr/local/bin/notify.sh: line 20: create_config_only: unbound variable
found on openshift in logs of container config-reloader
no those env variables were not available. in the end, we just deleted the keepalived operator and reinstalled the new version than everything worked. just the upgrade was not possible.
Keepalived-operator assumes that all nodes in a given KeepalivedGroup will have the same networking configurations and the same interface names. This is not always true and it makes for a harder setup in non-homogeneous environments.
This PR introduces an extra optional field on the CRD called
interfaceFromIP
. If this field is specified, theinterface
field will be ignored and the interface name will instead be autodiscovered on each node based on which interface would be used to reach the specified IP.This PR requires the keepalived-operator image to contain the
ip
,sed
andawk
binaries on the path.This PR should address and fix #11