project-akri / akri

A Kubernetes Resource Interface for the Edge
https://docs.akri.sh/
Apache License 2.0
1.1k stars 142 forks source link

Consider implementing a validator for Configurations #180

Closed DazWilkin closed 3 years ago

DazWilkin commented 3 years ago

Is your feature request related to a problem? Please describe.

kubectl validates (some?) specs when these are applied.

Configurations (Akri CRD) are not validated and this permits user-error and other mistakes to escape.

Is your feature request related to a way you would like Akri extended? Please describe.

During development of Zeroconf, I inadvertently broker the YAML spec for the Broker:

apiVersion: akri.sh/v0
kind: Configuration
metadata:
  name: zeroconf
spec:
  protocol:
    zeroconf:
      kind: "_rust._tcp"
      port: 8888
      txtRecords:
        project: akri
        protocol: zeroconf
        component: avahi-publish
  capacity: 1
  brokerPodSpec:
    imagePullSecrets: # Container Registry secret
      - name: ghcr
    containers:
      - name: zeroconf-broker
        image: ghcr.io/dazwilkin/zeroconf-broker@sha256:993e5b8d....
    resources: <----------------------------- INCORRECTLY INDENTED AFTER EDIT
              limits: 
                "{{PLACEHOLDER}}": "1"

The result was that, while Brokers were deployed, environment variables were not mounted correctly in the Pod.

It was only after lengthy debugging that we uncovered the error:

kubectl get pod/akri-zeroconf-9e1938-pod --output=yaml

spec:
  containers:
  - image: ghcr.io/dazwilkin/zeroconf-broker@sha256:993e5b8d...
    imagePullPolicy: IfNotPresent
    name: zeroconf-broker
    resources: {}

This may (!?) have been mitigated were the Configuration validated upon apply.

Describe the solution you'd like

kubectl apply --filename=./some-akri-config.yaml

Would yield e.g.:

error: error parsing ./some-akri-config.yaml: line XX: something wrong described

Describe alternatives you've considered

Being less stupid ;-)

This is the convention with Kubernetes and would be a useful addition.

Additional context

See: https://github.com/deislabs/akri/issues/177

DazWilkin commented 3 years ago

I've since learned that CRDs should be validated by dint of the OpenAPI spec...

Hmmm...In my example, I'd inadvertently added .spec.brokerPodSpec.resources instead of .spec.brokerPodSpec.containers[0].resources so it's unclear why this passed muster.

bfjelds commented 3 years ago

I've since learned that CRDs should be validated by dint of the OpenAPI spec...

Hmmm...In my example, I'd inadvertently added .spec.brokerPodSpec.resources instead of .spec.brokerPodSpec.containers[0].resources so it's unclear why this passed muster.

Our Configuration CRD has some validation bits in it .. but no validation for the pod or service specs. Hopefully, someday Kubernetes will make it easier to embed (with validation) existing resources.

In the meantime, maybe we need to do something to help.

DazWilkin commented 3 years ago

Aha!

I assumed the references were validated recursively.

I'm not particularly familiar with CRD development.

On Thu, Dec 17, 2020, 18:31 bfjelds notifications@github.com wrote:

I've since learned that CRDs should be validated by dint of the OpenAPI spec...

Hmmm...In my example, I'd inadvertently added .spec.brokerPodSpec.resources instead of .spec.brokerPodSpec.containers[0].resources so it's unclear why this passed muster.

Our Configuration CRD has some validation bits in it .. but no validation for the pod or service specs. Hopefully, someday Kubernetes will make it easier to embed (with validation) existing resources.

In the meantime, maybe we need to do something to help.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deislabs/akri/issues/180#issuecomment-747827735, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHZTQF4TMIPCXBPBUTLCLDSVK5HPANCNFSM4VAJX6EQ .

DazWilkin commented 3 years ago

Had a play (!) with this today:

https://github.com/DazWilkin/kubectl-akri

The concept is a kubectl plugin that would permit e.g. kubectl akri apply --filename=./something.yaml and get errors

But, I think it's not going to be appropriate as it's turtles all the way down; I'd need to implement the entire schema tree below e.g. PodSpec.

DazWilkin commented 3 years ago

As I drove to the dog camp, I realized that it may be sufficient after all.

I can simply validate the core Container spec and assume that e.g. PodSpec is valid. Perhaps an invalid PodSpec would be caught?

In my case, I'd inadvertently moved resources from ... container into the Configuration root and this type of error (incorrect Configuration entries), would be caught.

bfjelds commented 3 years ago

An approach that would help (but would be tied to an API version) would be to add validation to the CRD for the pod and service specs. Prometheus worked on a similar issue: https://github.com/kubernetes/kubernetes/issues/54579#issuecomment-370372942. Perhaps we could follow their lead until Kubernetes simplifies this.

Note: I also see someone suggesting using a ValidatingAdmissionWebhook ... we could add this to the controller maybe? https://stackoverflow.com/questions/56044319/validating-a-crd-with-an-embedded-core-v1-podspec

DazWilkin commented 3 years ago

The Webhook appears (!) trivial.

Do you support my giving that a whirl?

bfjelds commented 3 years ago

The Webhook appears (!) trivial.

Do you support my giving that a whirl?

Definitely! Validation would add a lot of value!!

DazWilkin commented 3 years ago

I spent some time learning more about Kubernetes' webhook today and the work is non-trivial.

I'll give it a little more time today, have another commitment tomorrow and will report back Wednesday.

DazWilkin commented 3 years ago

Oh boy! :smiley:

K8s webhooks are poorly documented and involved but I've made progress.

I can publish a ValidatingWebhook and it gets invoked. There's a bunch of infrastructural work to get this far.

I'm slightly perplexed as I was expecting to be able to filter the webhook by e.g. apiVersion: akri.sh/v0 mutations but, I'm having to use a super-wide filter (basically *), when kubectl apply...'ing an Akri Configuration to get it to work. Will spend more time later this week on it.

Once I can scope the webhook down to these, I must then write the validator for an Akri Configuration (but that should be the simplest part).

It's here: https://github.com/DazWilkin/akri-webhook

DazWilkin commented 3 years ago

OK, I got it working.... boy-oh-boy.... ValidatingWebhooks need better documentation and a blog post.

more ./zeroconf.yaml 
apiVersion: akri.sh/v0
kind: Configuration
metadata:
  name: zeroconf
spec:
  protocol:
    zeroconf:
      kind: "_rust._tcp"
      port: 8888
      txtRecords:
        project: akri
        protocol: zeroconf
        component: avahi-publish
  capacity: 1
  brokerPodSpec:
    imagePullSecrets: # Container Registry secret
      - name: ghcr
    containers:
      - name: zeroconf-broker
        image: ghcr.io/dazwilkin/zeroconf-broker@sha256:69810b62...
    resources: <-------------------- MISPLACED
          limits:
            "{{PLACEHOLDER}}": "1"

kubectl apply --filename=./zeroconf.yaml 
Error from server: error when creating "./zeroconf.yaml": admission webhook "wednesday.akri.sh" denied the request: Configuration does not include `resources.limits`

NOTE The Error is being thrown by the webhook because the Configuration is invalid.

NOTE The invalid part of the Configuration is in the PodSpec (!)

One oversight|consideration is that the incorrect (!) Configuration above is valid YAML and so, even after writing the webhook and getting it to work, the invalid Configuration passed.

What I've had to do is add manually checking for e.g. existence of .spec.container[*].resources.limits

I think an alternative would be to write a custom parser that invalidates additional (!) entries rather than permitting them.

bfjelds commented 3 years ago

Awesome...regarding the validation, I was mulling that over when I first read about the webhook. I wondered if we couldn't deserialize and serialized the input yaml, then use the original yaml to construct some kind of verification jsonpath-like queries. As I type it though, that sounds complex.

DazWilkin commented 3 years ago

That's not far from what I have.

I'm currently unmarshaling it into a strict and have code to validate, but using an existing schema would be better.

I wrote the handler in Golang but it should be straightforward to rewrite in Rust. I'm more familiar with Go, the examples that exist are in Go and all the types are, by definition, available in Go.

bfjelds commented 3 years ago

I would love to take a look, but I got a 404 error when trying out the link to your repo. Is it public?

DazWilkin commented 3 years ago

Apologies! It's now public

DazWilkin commented 3 years ago

OK. It works. See testing

It currently only checks for {.spec.brokerPodSpec.containers[*].resources.limits} and this should possibly be {.spec.brokerPodSpec.containers[*].resources.limits.\{\{PLACEHOLDER\}\}} and then the result cast to an int.

More importantly, there should be other tests for validity but I'm not entirely sure what they should all be.

I'd like to make the Webhook dynamic on a set of (flagged) templates. See: https://github.com/DazWilkin/akri-webhook/issues/4

One outstanding challenge is that JSONPath filters but it's unclear to me how to make these boolean functions.

In the above examples, I'd like: {IsInt(.spec.brokerPodSpec.containers[*].resources.limits.\{\{PLACEHOLDER\}\})}

Then, each filter could be a Boolean function and the Webhook could find the intersection of a set of such functions to determine whether the filters succeed or fail: test-1 && test-2 && .... && test-n

bfjelds commented 3 years ago

that is super! rather than deciding what should be set and shouldn't (though that is a good idea!!), i was thinking something more dynamic ... just check what the user is trying to add. i tried a real general approach here: https://play.golang.org/p/10OniB-QR5u. what do you think about that?

also, is there a way to host this in the existing controller?

DazWilkin commented 3 years ago

I explained myself poorly; set-not-set was a poor attempt to explain what you've achieved in your example.

In my code, the additional extra: 1 (as JSON) would (I think) parse because the JSON parser accepts the valid JSON and drops the superfluous extra.

Somehow (don't yet understand your code), the back-and-forth, addresses this which is clever!

I'll add your approach.

Would you like to merge this into the controller code? Or do you mean combine with the controller Deployment?

I'm going to try to fix what I think is another issue with the WebThings Rust SDK today and then blog about Kubernetes' Webhooks because the documentation is woeful and it will be good for me to explain it (to myself if no-one else),

...But I'll be back on this tomorrow.

bfjelds commented 3 years ago

cool! the back and forth should allow extra: 1 to be ignored on the deserialization, serializing the resulting struct should apply any valid settings. then, using the original yaml as a "map" to walk the deserialized-then-serialized tree should allow us to pick up any missing settings (extra: 1), mistyped (if we want) values, and unexpected set values.

my ideal would be to have this be rust code that existed in the controller binary ... we could add the service definition into the helm deployment code for the controller.

bfjelds commented 3 years ago

I truly love today's languages and their playgrounds ... here's a quick and dirty rust version: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=f11cc741f40e72b6a40f9fa29a319783

DazWilkin commented 3 years ago

I agree; they're awesome tools! Talking of which, I assume you're all using the alternative rust-analyzer

My plan today is to incorporate the YAML parser into the existing Golang handler.

Last night I was Googling around the k8s_openapi. IIUC the AdmissionReview and related types aren't currently exposed through OpenAPI and so I need to get my head around rust-swagger and the swagger files referenced in that issue to generate the structs to wrap your code (thank you very much that saved me a bunch of time) into a Rust-based handler.

Yes, to merging this into the controller binary and incorporating into the Helm chart.

DazWilkin commented 3 years ago

OK, a much improved (thank you!) version of the Golang webhook now uses the @bfjelds "twice-baked" check

I'm continuing to use JSON rather than YAML.

Now, my bad YAML example results in:

Error from server: error when creating "./zeroconf.webthings.yaml":
admission webhook "wednesday.yirella.svc" denied the request: Input index (spec) is not parsed correctly:
Input index (brokerPodSpec) is not parsed correctly:
Input index (resources) is not parsed correctly:
Input (map[limits:map[{{PLACEHOLDER}}:1]]) is not consistent with expected value

I'm missing something obvious though and recreated (!) it, the akri.sh/v0/Configuration schema :smiley:, in Golang, I have:

package main

import (
    v1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type Configuration struct {
    APIVersion string             `json:"apiVersion"`
    Kind       string             `json:"kind"`
    Metadata   *metav1.ObjectMeta `json:"metadata"`
    Spec       Spec               `json:"spec"`
}
type Spec struct {
    BrokerPodSpec            *v1.PodSpec            `json:"brokerPodSpec,omitempty"`
    Capacity                 int                    `json:"capacity"`
    ConfigurationServiceSepc *v1.ServiceSpec        `json:"configurationServiceSpec,omitempty"`
    InstanceServiceSpec      *v1.ServiceSpec        `json:"instanceServiceSpec,omitempty"`
    Properties               map[string]string      `json:"properties,omitempty"`
    Protocol                 map[string]interface{} `json:"protocol"`
    Units                    string                 `json:"units"`
}

I used the Rust Configuration as the guide.

Q: Should these types be defined by a Swagger spec?

I see that there are additional constraints on the ServiceSpecs of at most 1 per..., these constraints aren't currently being enforced through the YAML.

DazWilkin commented 3 years ago

Making good progress with the Rust variant:

https://github.com/DazWilkin/akri-webhook-rust

It's still a webhook but it's almost in Rust :smile:

Will file an issue tomorrow about externalizing types to make them more easily accessible from other projects.

DazWilkin commented 3 years ago

Such a doofus, I spent an hour banging my head against the wall trying to understand why serde was unwilling to unmarshall my test configuration.yaml into an Akri Configuration.

Eventually, I realized that I need to use KubeAkriConfg :smile:

So, it's working. Building a container image and will test it.

DazWilkin commented 3 years ago

Frustrating day... I broke the Golang implementation. I think by upgrading from v1beta to `v1.

The Rust implementation balks on creationTimestamp and resources and I'm unsure why.

If I exclude them, the tests pass:

serde_json::Value::Object(o) => {
    for (key, value) in o {
        println!("[check] key: {}", key);
        if key == "creationTimestamp" {
            println!("[check] creationTimestamp deserialized: {:?}", deserialized);
            return Ok(());
        }
        if key == "resources" {
            println!("[check] resources deserialized: {:?}", deserialized);
            return Ok(());
        }
        if let Err(e) = check(&value, &deserialized[key]) {
            return Err(None.ok_or(format!(
                "input key ({:?}) not equal to parsed: ({:?})",
                key, e
            ))?);
        }
    }
    Ok(())
}

In both cases, the deseralized object does not contain the entities:

The missing creationTimestamp:

Object({
        "annotations": 
            Object({
                    "kubectl.kubernetes.io/last-applied-configuration": 
                        String("{ ... }"  )
            }), 
        "generation": Number(1.0),
        "name": String("webthings"),
        "namespace": String("default"),
        "uid": String("b61d7604-97f0-46a3-adb2-adc0dedfb0c9")
})

The missing resources:

Object({
        "containers": 
            Array([
                Object({
                        "image": String("ghcr.io/..."),
                        "name": String("zeroconf-broker")
                })
            ]),
            "imagePullSecrets": Array([
                Object({
                        "name": String("ghcr")
                })
            ])
    }
)

I'm unclear as to why.

Perhaps @bfjelds you have an inkling into the issue?

bfjelds commented 3 years ago

I've seen missing properties in what Rust kube exposes...I think I stumbled across missing create timestamps as well.

I wouldn't think that it would be a problem for this case though...seems like an odd property for someone to include in a Configuration (i would expect that to be a read only generated property).

Not sure about resources...

bfjelds commented 3 years ago

PodSpec resources should be exposed ... akri is accessing and modifying them here. Curious that it is not making it through the parse/unparse/reparse. I am curious at what point it is lost (is resources present after the initial typed parse? Is it in the deserialized string?)

DazWilkin commented 3 years ago

I was having difficulties parsing the debug output; the entire tree(s) appear to be lazily evaluated by the debugger and I was unsure how to force its hand.

I'll spend more time with it this morning.

DazWilkin commented 3 years ago

User (my) error on resources. I was testing valid and invalid JSON at the same time and the resources error was correctly arising from the misplaced resources section in the (intentionally) incorrect JSON.

However, there remain issues with creationTimestamp and managedFields (see below) and I have a hunch it's because akri-shared uses a (really) old version of kube (0.23.0 December 2019 ;-)) which corresponds to this commit when kube appeared to contain its own definition of Kubernetes' ObjectMeta:

https://github.com/clux/kube-rs/blob/023c785c0c451c7c8b1a3f37e37d3f937ec44467/src/api/metadata.rs#L38

In more recent versions, this uses k8s_openapi implementation:

https://github.com/clux/kube-rs/blob/54e74fb0b2321e41e9e0d8f9165639d07519379a/kube/src/api/metadata.rs#L19

I suspect this is the cause of both issues although I'm unable to explain the *Timestamp problem.

bfjelds commented 3 years ago

I'm not really familiar with ManagedFields (though from this documentation it sounds like something we can document as unvalidated for now) ... i think that creationTimestamp is something that people should not be setting in their Configuration. So maybe, for the sake of better-but-not-perfect, we can say that those properties are both either errors or unvalidated.

in the meantime, it seems like a worthwhile issue to point out how old kube is.

DazWilkin commented 3 years ago

I am unfamiliar with ManagedFields but the section appear to be generated by the CRD|Operator process.

Edit: I just realized from your comment that ManagedFields are created by the Webhook process.

I'm not setting creationTimestamp in the Configuration spec; it appears that this becomes fully 'materialized' before it's shipped to the validator.

Here's my input:

dazwilkin@akri:~$ more ./zeroconf.webthings.yaml 
apiVersion: akri.sh/v0
kind: Configuration
metadata:
  name: webthings
spec:
  protocol:
    zeroconf:
      kind: "_webthing._tcp"
      port: 8888
  capacity: 1
  brokerPodSpec:
    imagePullSecrets: # Container Registry secret
      - name: ghcr
    containers:
      - name: zeroconf-broker
        image: ghcr.io/dazwilkin/zeroconf-broker@sha256:69810b62...
        resources:
          limits:
            "{{PLACEHOLDER}}": "1"

And here's the AdmissionReview that the Webhook receives. The Akri Configuration is .request.object:

kind: AdmissionReview
apiVersion: admission.k8s.io/v1
request:
  uid: 2b752327-a529-4ffd-b2e2-478455e80a0d
  kind:
    group: akri.sh
    version: v0
    kind: Configuration
  resource:
    group: akri.sh
    version: v0
    resource: configurations
  requestKind:
    group: akri.sh
    version: v0
    kind: Configuration
  requestResource:
    group: akri.sh
    version: v0
    resource: configurations
  name: webthings
  namespace: default
  operation: CREATE
  userInfo:
    username: admin
    uid: admin
    groups:
      - system:masters
      - system:authenticated
  object:
    apiVersion: akri.sh/v0
    kind: Configuration
    metadata:
      annotations:
        kubectl.kubernetes.io/last-applied-configuration: '{"apiVersion":"akri.sh/v0",...'
      creationTimestamp: "2020-12-30T17:47:26Z"
      generation: 1
      managedFields:
        - apiVersion: akri.sh/v0
          fieldsType: FieldsV1
          fieldsV1:
            f:metadata:
              f:annotations:
                ".": {}
                f:kubectl.kubernetes.io/last-applied-configuration: {}
            f:spec:
              ".": {}
              f:brokerPodSpec:
                ".": {}
                f:containers: {}
                f:imagePullSecrets: {}
              f:capacity: {}
              f:protocol:
                ".": {}
                f:zeroconf:
                  ".": {}
                  f:kind: {}
                  f:port: {}
          manager: kubectl
          operation: Update
          time: "2020-12-30T17:47:26Z"
      name: webthings
      namespace: default
      uid: b61d7604-97f0-46a3-adb2-adc0dedfb0c9
    spec:
      brokerPodSpec:
        containers:
          - image: ghcr.io/dazwilkin/zeroconf-broker@sha256:69810b62...
            name: zeroconf-broker
            resources:
              limits:
                "{{PLACEHOLDER}}": "1"
        imagePullSecrets:
          - name: ghcr
      capacity: 1
      protocol:
        zeroconf:
          kind: _webthing._tcp
          port: 8888
  oldObject:
  dryRun: false
  options:
    kind: CreateOptions
    apiVersion: meta.k8s.io/v1
DazWilkin commented 3 years ago

my ideal would be to have this be rust code that existed in the controller binary ... we could add the service definition into the helm deployment code for the controller.

I mentioned to @kate-goldenring that I upgrade to K8s 1.20 and broke the certificate process in the Webhook. I decided this morning to try cert-manager (it is excellent). I have that working as a self-signing CA generating certs for the Webhook service; it would be trivial for others to swap our self-signing and swap-in e.g. Let's Encrypt. Using cert-manager makes the deployment process for the Webhook more streamlined (mostly Kubernetes specs) and should be straightforward to add to the Helm chart for Akri.

IIUC the Controller is not an HTTP daemon. Obviously, the Webhook is and it requires TLS. The only shared dependency with Akri is the shared member (for KubeAkriConfig etc.). I propose (respectfully and acknowledging your ideal) that

Kubernetes Webhooks are distinct processes and out-of-band; they're registered as callbacks in API server. Their development is distinct too; the resources they use aren't part of the K8s core and may evolve out-of-sync too. I'm still taking a dependency for the Rust solution on a self-generated 'crate' from an OpenAPI spec published in an issue. The requirement for TLS adds complexity and it may be easier to keep it all out of the existing "core".

Ultimately, it's your project and I'll do as you wish.

bfjelds commented 3 years ago

I'm totally open to separating this from controller if it makes sense. No need to force a square peg into a round whatever :)

bfjelds commented 3 years ago

I appreciate your thought and energy ... am really looking forward to this!!

DazWilkin commented 3 years ago

The Helm Chart is working: https://github.com/DazWilkin/akri-webhook-helm

Tomorrow, I'll begin merging the Rust-based Webhook and the Helm chart into Akri proper.

bfjelds commented 3 years ago

The Helm Chart is working: https://github.com/DazWilkin/akri-webhook-helm

Tomorrow, I'll begin merging the Rust-based Webhook and the Helm chart into Akri proper.

I wish i could give more than one thumbs up :)