Handle env section in deployment yaml

Changes made:

Added two new types:
- ConfigObject: contains the k8s Object, a flag for whether this entry is only "single-field" env entries, and a list of referenced keys (map of key-name to a ConfigField)
- ConfigField: contains metadata about that entry. At the moment this is only a flag for whether that field is optional.
When searching for "children"
- Searching for children now looks in the env section, returning two new maps (one for ConfigMaps and one for Secrets), of maps (the second key being the key-name in the ConfigMap/Secret being accessed). This code also handles the optional field (see below for more fun on optional).
- Read in the data from the ConfigMaps/Secrets referenced in the env section. Note that this only looks at single-field entries if we don't already pull in the entire ConfigMap/Secret. Given that requesting the ConfigMap/Secret is multi-threaded, I have passed in the metadata about being a single-field, and then passed it back in the "getResult" object. I was also a bit paranoid in the algorithm for combining results so that also handles both single-field and entire-object results, merging such that the entire-object wins. This did mean building up a map of results, then converting to an array when done.
- The "children" unit tests have been extended to catch as many new cases as I can think of.
- I didn't change handler.go as the general flow hasn't changed, but I added some extra tests. Note while adding tests, I think I found a bug. There are several tests that check that the hash has changed, but the test code does this by checking that it does not find the original hash in an annotation, and I think it currently looks in the wrong annotation (and as such, the test would never fail). For example, https://github.com/pusher/wave/blob/master/pkg/core/handler_test.go#L229 calls m.Eventually(deployment, timeout).ShouldNot(utils.WithAnnotations(HaveKeyWithValue(ConfigHashAnnotation, originalHash))), but I think it should call m.Eventually(deployment, timeout).ShouldNot(utils.WithPodTemplateAnnotations(HaveKeyWithValue(ConfigHashAnnotation, originalHash))). Please correct me if I've missed something!
- The hashing code includes the new single-field data and includes only those fields in the hash it builds up. Lots more tests added for the new behavour.

Testing done:

New unit tests added and run
Docker image built and tested in a minikube environment.
- With an explicitly requested ConfigMap entry in the env section, only that pod is redeployed when only that value changes.
- With an explicitly requested Secret entry in the env section, only that pod is redeployed when only that value changes.
- With an optional entry in an existing ConfigMap, when that entry is referenced in the env section, only that pod is redeployed when that value is first added or when it changes.
- With an optional entry in an existing Secret, when that entry is referenced in the env section, only that pod is redeployed when that value is first added or when it changes.
- Existing behaviour for references using volume and envFrom is maintained: any change to referenced ConfigMaps/Secrets triggers a redeploy of only those pods that reference them.
- An env section that contains two references to the same field in a ConfigMap/Secret but with different names successfully redeploys when that value changes (yep, we've got this in our system - two different squads with different names for the same configuration parameter!)
I've not done any soak-testing or any heavy load testing so I hopefully haven't introduced any memory leaks.

What is not in this PR:

I have not added anything that handles the optional flag on envFrom. In that case, you can say whether the entire ConfigMap is allowed to be missing or not, but technically I was only working on the env section :)
There is an edge case that I have not solved that relates to the optional flag on entries in the env section, partly because I'm not sure how. Image that the following are all true:
- I have an optional reference in an env section
- It is the only reference to a particular ConfigMap/Secret in my deployment YAML
- That entire ConfigMap/Secret does not exist
In this case, when I deploy, we attempt to look up the field and get an error (e.g. E0114 17:54:50.356820 1 :0] kubebuilder/controller "msg"="Reconciler error" "error"="error fetching current children: error(s) encountered when geting children: ConfigMap \"config-optional\" not found" "Controller"="deployment-controller" "Request"={"Namespace":"default","Name":"charlie"}. But k8s thinks this is a valid deployment file, so honours the deployment and the pod starts up anyway, just without the missing config being defined.

If I then create that ConfigMap, Wave takes some time to notice. The problem is that it is relying on k8s backoff to "re-kick" the Wave handler after some timeout, at which point it successfully discovers the ConfigMap/Secret, and generates the hash. Ideally, it would notice instantly, as with any other config change, but since the ConfigMap/Secret did not exist we never got to register "this" deployment as an owner. Overal, it seems to work and "sort itself out in the end", but I'm just nervous about errors in the log but an apparently working system.

I've made the specific code changes suggested and committed them. For the discussion on the overall architecture of the change, I've tried to respond with my thoughts to each point below.

It's worth saying that for this PR, I tried to maintain the existing architecture that discovery and collection of ConfigMaps/Secrets happens at the point at which we scan the deployment (we "get" the whole ConfigMap/Secret and keep it in memory), and processing that collected data is done separately at a later stage (i.e. where the code calculates the hash value). This is why I record keys (and whether they're optional) in a new type, and then change the hashing code to handle the extra metadata. The fact that both children.go and hash.go have changed is due to trying to maintain that behaviour, and not necessarily because of handling the optional field.

My overriding thought when reading this is that the work done here on optional fields doesn't add any value and rather over-complicates the code. If a field is optional, we report an error when it doesn't exist, but actually, do we care? If the field isn't there, and we just set it to an empty string, the hash would change once the field exists again, therefore having the desired effect? Do you agree or have I missed something?

It's worth saying that I don't think the code reports an error when an optional field does not exist, in all cases. I think it only reports an error when an optional field does not exist because the entire ConfigMap/Secret does not exist.

The benefit I see to knowing if fields are optional is that we wouldn't throw an error if a configmap/secret doesn't exist but all the fields we want from within it are optional, which we are not currently handling (and I'm not suggesting we do within this PR, I'll create an issue to follow up on that)

I suspect that might be trivial to add; Just make getConfigMapWithKey()/getSecretWithKey() return an empty getResult when the object does not exist and the requested key is optional. I think getCurrentChildren() would then just ignore the empty getResult object. Any required keys would still result in an error.

If we remove the requirement to know if fields are optional or not, this whole PR could be massively simplified. getChildNamesByType could be rewritten to return two map[string]map[string]struct{} where the first map is the name of the configmap/secret and the second is the the fields from within the configmap/secret that are being requested.

That is close to what I originally had before adding the handling of optional values (with the caveat that the hashing code was also changed to handle both whole objects and specified keys within objects).

Then this set of fields can be passed into getObjectMap[WithKeys]? and the filtering of the fields could be done there instead, returning the object with a possible subset of the expected data fields. (this also means we would only perform a Get on each object once).

Then no changes are required to be made to hashing algorithm and I think the extra types you've added would no longer be needed either.

If the hash code is going to maintain a simple interface and only accept an array of one type of thing, I think that hashing code has to change either way: the current code expects to find an Object, so to maintain that, at the point I'm scanning the Deployment and pulling in the ConfigMaps/Secrets, for anything that's specifying an individual key (optional or not), would I have to create a fake Object so I can pass it in to the hashing code? Alternatively, we could process the ConfigMaps/Secrets when discovering them and stringify them all then, but would we then change the hashing code to accept the stringified data rather than an array of Objects/configObjects?

I'm thinking that either the hashing code changes to handle key-specific entries, or the ConfigMaps/Secrets discovery code changes to create "spoof" Objects. I chose the first as it seemed more explicit on what it was doing.

Coming back to this after a couple of days, I suspect/agree that you're right that the optional handling can be cleaner. I could, at the point of discovering ConfigMaps/Secrets (i.e. in getChildNamesByType()) also inspect the ConfigMap/Secret. If the field is missing, I decide whether to error or continue based on whether that field is optional. For required fields, I can error (as I think I should - you required a specific field and it isn't there), and for optional fields I silently do nothing (as we surmise we should - it's missing for whatever reason, but it's optional).

That then still gives us an array of configObjects to handle, where that list has been pruned such that there are no duplicates at the ConfigMap/Secret scope (if it's there it's either for the whole object or for a list of keys in that object, but never both).

The hash code would still need to change to accommodate specific keys.

Does my proposed simplification make sense? I think it's the same as yours but with the hash code changing slightly (is it a key, get the key and stringify it, if not stringify the whole thing).

I've thought about this a bit more over night and spotted a problem with the proposal to simplify, and to only record an optional field when it is present. Consider the following case:

The deployment has a reference to a key in a ConfigMap/Secret in the env section
That is the only reference to that specific ConfigMap/Secret in my deployment
That reference is optional
The value does not exist

When I call updateOwnerReferences() I must include the referenced ConfigMap/Secret so that I can attach as an owner and trigger recalculating the hash should the key then be added.

Given that updateOwnerReferences() must be passed a list of all ConfigMaps/Secrets that exist and that I am referencing, irrespective of whether specific fields exist (because they may exist in the future), I cannot filter out an entry when I encounter an optional and missing field.

I think that then goes back to your original idea that: if we record the data at the point of scanning, and for missing optional fields we set the data to an empty string then we can handle that case. But I still think that route involves changing the hash function, and either modifying or build-a-fake-ing an Object (and I'm not sure I'd understand what updateOwnerReference() does with that "fake" Object well enough to not get it wrong).

I keep re-reading this and thinking about the best approach and I think I'm coming around to where we have got with this already, or at least some middle ground between our two initial suggestions.

Do you think this would work:

If getChildNamesByType returned two map[string]configObject but where configObject was:
```
type configObject {
required bool
allKeys bool
keys map[string]struct{}
}
```

When parsing the deployment, getChildNamesByType could populate the keys map for any single fields referenced, if no subkeys are referenced it would mark allKeys true and if there is a non-optional field it would mark required true.

getConfigMap/getSecret could then use the required field to determine whether or not to throw an error if the referenced ConfigMap or Secret does not exist.
calculateConfigHash could then use the allKeys field to determine whether to filter the keys it adds to its hashSource and would just set any key referenced which doesn't exist to "" (since changing the value will trigger a rollout anyway, this should be fine). Importantly I don't think calculaterConfigHash needs to know about optional fields, but we do need to know whether to ignore a ConfigMap or Secret if it only contains optional fields which should be sorted by 2.

In this idea, we solve the problem of optional config objects that don't exist; all config objects that exist and are referenced (optionally or not) will get owner refs and the owner ref code won't be changed.

For required fields, I can error (as I think I should - you required a specific field and it isn't there), and for optional fields I silently do nothing (as we surmise we should - it's missing for whatever reason, but it's optional).

I suggest we don't throw an error if required fields don't exist, in fact, my suggestion doesn't track which fields are optional or not at all. The reason for this being that, if a field is required or not does not affect us calculating a hash, if the field becomes present at a later time then the hash will change and Wave's responsibility is still fulfilled. It is not Wave's responsibility to track whether fields are optional or not, it is the deployment controller's. Equally, if a field is non-optional, and doesn't exist, then the deployment will go into a bad state anyway, so whether Wave is adding a hash or not doesn't particularly matter.

I guess to summarise my point here is, required fields not being present won't prevent the hash changing when the configmaps or secrets change, so why does Wave care?

I suggest we don't throw an error if required fields don't exist, in fact, my suggestion doesn't track which fields are optional or not at all. ... It is not Wave's responsibility to track whether fields are optional or not, it is the deployment controller's.

Yes, this started to dawn on me halfway through reading point 3 :)

From an initial read through, I think your plan above works. The one thing I'm not clear of is the function interfaces at all points in the chain, as there is a description above for the interface for getChildNamesByType() but not for getCurrentChildren().

I think that I still need the list returned by getCurrentChildren() to contain the ConfigMap/Secret under k8sObject for each entry, so that updateOwnerReferences() / updateOwnerReference() can just update the owner and push it back at k8s core, and so that the hashing can just use the data we've already collected.

I was trying to think how to have both getCurrentChildren() and getChildNamesByType() use configObject, like maybe one builds the list from the deployment, and the other fills in the k8sObject but rapidly started worrying about concurrent map access.

Would something like this be okay?

In types.go:

// This is the main interface for what comes out of children.go, 
// such that owner_references.go and hash.go act on a list of these
type configObject struct {
  k8sObject Object
  required bool
  allKeys bool
  keys map[string]struct{}
}

In children.go:

// Interface for results from getConfigMap()/getSecret()
type getResult struct {
  err error
  obj Object
  required bool
  allKeys bool
  keys map[string]struct{}
}

// Interface for results from getChildNamesByType()
type configMetadata struct {
  required bool
  allKeys bool
  keys map[string]struct{}
}

func (h *Handler) getCurrentChildren(obj *appsv1.Deployment) ([]configObject, error) {
  // getChildNamesByType() to get two maps

  // loop through each map, calling getConfigMap()/getSecret() on a thread

  // Wait for each "get" call and add result to a list of configObject

  // return list of configObject
}

func getChildNamesByType(obj *appsv1.Deployment) (map[string]configMetadata, map[string]configMetadata) {
  // Walk through deployment creating two consolidated maps (keyed on ConfigMap/Secret name)
  // Each map lists the metadata for that ConfigMap/Secret in an instance of configMetadata.
}

// There is only one of these, so no "WithKeys" version any more.
func (h *Handler) getConfigMap(namespace, name string, required allKeys bool, keys map[string]struct{}) getResult {
  ...
}

// There is only one of these, so no "WithKeys" version any more.
func (h *Handler) getSecret(namespace, name string, required allKeys bool, keys map[string]struct{}) getResult {
  ...
}

owner_references.go stays as is in this PR.

hash.go still has the interface func calculateConfigHash(children []configObject) (string, error) but is a bit simpler, and does not return errors when handling optional fields.

I'd like to give this a day in the back of my brain, in case I spot anything else, but if not then are you okay if I start coding to the interfaces in this comment?

Yep, the above looks pretty good to me, just a couple of comments on it:

I think that I still need the list returned by getCurrentChildren() to contain the ConfigMap/Secret under k8sObject for each entry, so that updateOwnerReferences() / updateOwnerReference() can just update the owner and push it back at k8s core, and so that the hashing can just use the data we've already collected.

Yep agreed, the passing around will be similar to how it is currently implemented.

Any reason not to pass the configMetadata directly into getConfigMap and getSecret?

 // There is only one of these, so no "WithKeys" version any more.
func (h *Handler) getConfigMap(namespace, name string, meta configMetadata) getResult {
  ...
}

I really dislike the k8sObject field inside configObject, can we just call it object instead? (I've seen this done in other Kubernetes objects and interfaces like the unstructured.Unstructured for instance)

Thanks for working with me on this design btw, I appreciate your efforts

No worries; if the result is better code then I'm all for it!

Rolling in your latest comment, I'll go with this:

In types.go:

type configObject struct {
  object Object
  required bool
  allKeys bool
  keys map[string]struct{}
}

In children.go:

type getResult struct {
  err error
  obj Object
  required bool
  allKeys bool
  keys map[string]struct{}
}

// Interface for results from getChildNamesByType()
type configMetadata struct {
  required bool
  allKeys bool
  keys map[string]struct{}
}

func (h *Handler) getCurrentChildren(obj *appsv1.Deployment) ([]configObject, error) {
  // getChildNamesByType() to get two maps
  // loop through each map, calling getConfigMap()/getSecret() on a thread
  // Wait for each "get" call and add result to a list of configObject
  // return list of configObject
}

func getChildNamesByType(obj *appsv1.Deployment) (map[string]configMetadata, map[string]configMetadata) {
  // Walk through deployment creating two consolidated maps (keyed on ConfigMap/Secret name)
  // Each map lists the metadata for that ConfigMap/Secret in an instance of configMetadata.
}

func (h *Handler) getConfigMap(namespace, name string, metadata configMetadata) getResult {
  ...
}

func (h *Handler) getSecret(namespace, name string, metadata configMetadata) getResult {
  ...
}

owner_references.go stays as is in this PR.

hash.go still has the interface func calculateConfigHash(children []configObject) (string, error) but is a bit simpler, and does not return errors when handling optional fields.

Sounds great, thanks 👍

All done, I think.

I had to use pointers to configMetadata in the maps that getChildNamesByType() returns as I needed to update the required field as I scanned the deployment, and golang doesn't allow editing the contents of a struct in a map value.

I've built the image locally and run it through a few scenarios as well, testing in minikube, and it seems to work. I've not done any load testing though.

Code updated in line with review. I've done my usual manual testing in minikube and the one case that doesn't work is where a field in a ConfigMap is optional, and the whole map does not exist (so we cannot attach an owner). If I then create that ConfigMap, Wave does not trigger calculation of the hash, since the new ConfigMap has no owner. This feels like an edge case, and I'm not sure how to "fix" it without a complete rearchitecture of how Wave works...

wave-k8s / wave

Handle env section in deployment yaml #34