sbordeyne / kanidm-operator

Kubernetes Operator for KanIDM
4 stars 2 forks source link

Project not usable? #11

Open toastedcrumpets opened 4 months ago

toastedcrumpets commented 4 months ago

First, I want to say the idea of a kanidm operator is a fantastic one!

I tried to use it as follows, first clone the repo, then

$ kubectl kustomize manifests/crds > crds.yaml
$ kubectl kustomize manifests/operator > operator.yaml
$ kubectl apply -f crds.yaml
namespace/kanidm-system created
customresourcedefinition.apiextensions.k8s.io/accounts.kanidm.github.io created
customresourcedefinition.apiextensions.k8s.io/groups.kanidm.github.io created
customresourcedefinition.apiextensions.k8s.io/kanidms.kanidm.github.io created
customresourcedefinition.apiextensions.k8s.io/oauth2-clients.kanidm.github.io created
customresourcedefinition.apiextensions.k8s.io/password-badlists.kanidm.github.io created
customresourcedefinition.apiextensions.k8s.io/service-accounts.kanidm.github.io created
$ kubectl apply -f operator.yaml
namespace/kanidm-system configured
serviceaccount/kanidm-operator-account created
role.rbac.authorization.k8s.io/kanidm-operator-role-namespaced created
clusterrole.rbac.authorization.k8s.io/kanidm-operator-role-cluster created
rolebinding.rbac.authorization.k8s.io/kanidm-operator-rolebinding-namespaced created
clusterrolebinding.rbac.authorization.k8s.io/kanidm-operator-rolebinding-cluster created
deployment.apps/kanidm-operator created

Then I try running the example

$ kubectl apply -f manifests/example/kanidm.yaml
error: resource mapping not found for name: "kanidm" namespace: "kanidm" from "manifests/example/kanidm.yaml": no matches for kind "Kanidm" in version "v1alpha1"
ensure CRDs are installed first

Which doesn't make sense to me.

I also noticed that the operator pod is having issues terminating with the log

exec /root/.local/bin/poetry: exec format error

I tried forking, rebuilding the docker image, then deploying but my pod fails a liveness probe and seems to be stuck.

It seems development has halted, but I'd like to help out if I can! I'm a competant python developer, but I've never written a k8s operator before.

toastedcrumpets commented 4 months ago

OK I've debugged the first issue, the example manifest needs apiVersion to have the group name from the CRD:

apiVersion: kanidm.github.io/v1alpha1
kind: Kanidm
metadata:
  name: kanidm
  namespace: kanidm
spec:
  version: 1.2.0
  database:
    fsType: other
    storageClass: longhorn
    storageSize: 1Gi
  domain: idm.example.com
  certificate:
    issuer: cluster-issuer
  logLevel: info
  backup:
    enabled: true
    schedule: "0 9 * * *"
    versions: 7
    storageClass: longhorn
    storageSize: 7Gi
  webPort: 8443
  ldapPort: 3890
  highAvailability:
    enabled: false
    replicas: 1

I'll start playing around with the docker image to see why my build isn't responding to liveness probes...

sbordeyne commented 4 months ago

Hello, thanks for this issue.

I've transitionned away from kanidm as my IdP to go back to something more robust (namely keycloak), so I've halted development on that side project. The initial goal was to provide a declarative way to instanciate kanidm in a cluster, which should probably just be an OpenTF provider (but I digress)

If you're interested in picking the project back up, I'll review and merge any PRs that you open, I might make a bit of CI/CD to help as well in my spare time.

toastedcrumpets commented 4 months ago

Hey, thanks for getting back to me. I've been playing around in my fork (https://github.com/toastedcrumpets/kanidm-operator) and I've managed to get it all booting. Its not deploying because it seems to be double creating the certificate somehow. I'll keep playing around with it to learn more about kopf as its an interesting library!

I'd be interested in how you deploy keycloak. Ultimately I want a purely gitops solution to IdP, so if they have a good operator I'd probably transition to that as well. Kind regards, Marcus

sbordeyne commented 4 months ago

I don't have a good operator, I'm deploying a custom keycloak image (as documented in the official docs). I've seen some terraform providers for keycloak though, and I know there's an official operator (that didn't work for me but maybe it'll be useful to you)

toastedcrumpets commented 4 months ago

I've now managed to get the operator so it works on my 1.27 microk8s cluster. I had to fix quite a few issues with the template specs.

I notice that some of the templates are broken, like the ingress.yaml, and so you need to add ingress: false to the Kanidm spec when using it. Do you have any other commits not pushed that address some of these issues, or a fully working example yaml that has these? I've managed to figure out everything regarding the deployment of kanidm, but I'm now about to try the other functionality (groups/users/etc). Can you comment on what is working and what is not? Thanks!

sbordeyne commented 4 months ago

I need to pick this project back up, so far I haven't committed anything from my local repo because I got distracted with other projects. I'll see what I can work on this weekend to get it to an usable state

toastedcrumpets commented 4 months ago

If you have anything in a separate repository, please just push it to github master ASAP. It doesn't matter if its broken, as the github master branch is broken on current k8s versions anyway. For my own work I need to get something fully working in the next two days or so, so merging in your offline changes (if they're extensive) will get harder for me the longer you wait. Now that I have the operator deploying I'm going to set up a github workflow to build the image, push to docker and run some basic deployment tests.

sbordeyne commented 4 months ago

Don't worry about it, I'll handle the merge conflicts when they arise. It should not be too difficult anyways. Don't hesitate to open a pull request with your changes.

I'll probably carefully review the workflow CI, but other than that, your contributions will be merged quite rapidly.

toastedcrumpets commented 4 months ago

OK, I'll probably try to make a pull request back to your repo once I've got everything working. For now I'll keep working in my repo. I've just implemented a github action that now spins up a k8s cluster, runs the operator, adds a kanidm deployment, then closes checking for errors. We can build on this as I check out other functionality. More to come!

toastedcrumpets commented 4 months ago

I thought I'd give you an update. The deployment now works fully, the secrets for the admin passwords are deployed. I'm testing adding a user, but it seems the python kanidm library has bitrot.

Adding a user gives this response

[2024-05-25 07:17:11,429] kopf.objects         [ERROR   ] [kanidm/marcus-user] Handler 'on_create_account' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/kanidm-operator-9TtSrW0h-py3.11/lib/python3.11/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/kanidm-operator-9TtSrW0h-py3.11/lib/python3.11/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/kanidm-operator-9TtSrW0h-py3.11/lib/python3.11/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^
  File "/app/kanidm_operator/deploy/account.py", line 19, in on_create_account
    await client.person_account_create(
  File "/root/.cache/pypoetry/virtualenvs/kanidm-operator-9TtSrW0h-py3.11/lib/python3.11/site-packages/kanidm/__init__.py", line 747, in person_account_create
    return await self.call_post(Endpoints.PERSON, json=payload)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/kanidm-operator-9TtSrW0h-py3.11/lib/python3.11/site-packages/kanidm/__init__.py", line 257, in call_post
    return await self._call(
           ^^^^^^^^^^^^^^^^^
  File "/app/kanidm_operator/auth.py", line 99, in _call
    token_valid = await self.check_token_valid()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/kanidm_operator/auth.py", line 69, in check_token_valid
    valid = await super().check_token_valid(token)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/kanidm-operator-9TtSrW0h-py3.11/lib/python3.11/site-packages/kanidm/__init__.py", line 146, in check_token_valid
    result = await self.call_get(endpoint, headers=headers)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/kanidm-operator-9TtSrW0h-py3.11/lib/python3.11/site-packages/kanidm/__init__.py", line 246, in call_get
    return await self._call("GET", path, headers, timeout, params=params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/kanidm_operator/auth.py", line 98, in _call
    return await super()._call(method, path, headers, timeout, json, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/kanidm-operator-9TtSrW0h-py3.11/lib/python3.11/site-packages/kanidm/__init__.py", line 223, in _call
    response = ClientResponse.model_validate(response_input)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/kanidm-operator-9TtSrW0h-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 551, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for ClientResponse
data
  Input should be a valid dictionary [type=dict_type, input_value='notauthenticated', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/dict_type
[2024-05-25 07:17:11,524] kopf.objects         [WARNING ] [kanidm/marcus-user] Patching failed with inconsistencies: (('remove', ('status',), {'kopf': {'progress': {'on_create_account': {'started': '2024-05-24T10:06:59.059987+00:00', 'stopped': None, 'delayed': '2024-05-25T07:18:11.431032+00:00', 'purpose': 'create', 'retries': 971, 'success': False, 'failure': False, 'message': "1 validation error for ClientResponse\ndata\n  Input should be a valid dictionary [type=dict_type, input_value='notauthenticated', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.7/v/dict_type", 'subrefs': None}}}}, None),)

Meanwhile the operator has these errors

a18a33ff-b333-4c73-96af-043102114420 INFO     request [ 169µs | 54.79% / 100.00% ] method: GET | uri: /v1/auth/valid | version: HTTP/1.1
a18a33ff-b333-4c73-96af-043102114420 INFO     ┝━ handle_auth_valid [ 76.2µs | 38.32% / 45.21% ]
a18a33ff-b333-4c73-96af-043102114420 INFO     │  ┝━ validate_client_auth_info_to_ident [ 11.6µs | 6.89% ]
a18a33ff-b333-4c73-96af-043102114420 ERROR    │  │  ┝━ 🚨 [error]: algorithm not available on this provider | provider_uuid: 00000000-0000-0000-0000-ffffff000025 | unsupported_alg: HS256
a18a33ff-b333-4c73-96af-043102114420 INFO     │  │  ┕━ i [info]: Unable to verify token | event_tag_id: 10 | err: KP0019KeyProviderUnsupportedAlgorithm
a18a33ff-b333-4c73-96af-043102114420 ERROR    │  ┕━ 🚨 [error]: Invalid identity: NotAuthenticated | event_tag_id: 1
a18a33ff-b333-4c73-96af-043102114420 WARN     ┕━ 🚧 [warn]:  | latency: 184.367µs | status_code: 401 | kopid: "a18a33ff-b333-4c73-96af-043102114420" | msg: "client error"

I'm not sure I want to try to debug this second project too, my suggestion is to instead spin up a sidecar container in the kanidm deployment of https://hub.docker.com/r/kanidm/tools/tags. Then I can use pod-exec to run the kanidm CLI to do my operations. What do you think?

sbordeyne commented 4 months ago

This second project is part of kanidm and it's technically officially maintained (but not much maintenance is done there tbh). It would be worth debugging anyways. I'll try to have a look today, see if the fix is obvious or not. Bootstrapping a sidecar container should be a last resort option, an operator should not take a significant amount of resources to perform its operations, hence why I don't think it's such a good idea.

sbordeyne commented 4 months ago

The issue seems to be with your authentication token anyways

toastedcrumpets commented 4 months ago

Thanks for the rebuttal, it's great to talk this out.

Yes the python library is "official" but their faq has this to say https://kanidm.github.io/kanidm/master/developers/faq.html#is-the-library-in-rust "If it's not in Rust, it's not eligible for inclusion. There is a single exception today (rlm python) but it's very likely this will also be removed in the future." So while it's still around, it's deprecated.

Regarding the token issue, that's entirely within the kanidm python library I think? You have a thin wrapper around their client class from what I see, so I'm not sure where the change on this side would be.

Regarding the side car, I'm prototyping something now. It's just an extra container on the kanidm deployment, simply to give me access to the kanidm cli tool. It stays running with a "sleep infinity" command and only uses additional resources when I exec into the pod to run commands. A better option would be to create k8s Jobs to carry out the changes perhaps? Then the container is deleted once the action is complete? I might go that way once I've proven the CLI approach works. The main blocker is that the CLI doesn't use tokens, only try input of passwords. I can fake this out, the py k8s library has an example, but it's ugly.

If you can help me fix the token issue I'll happily carry on using the python client, but I'm not taking on that project too, especially as the original developers want it to go away eventually!

sbordeyne commented 4 months ago

I think eventually using k8s jobs would be way cleaner, the only thing would be to make sure the jobs are cleaned out properly once the changes are made.

Another option I can see using is to install the kanidm CLI in the docker image. The binary is quite small, but it would force the operator to be kept in sync with the kanidm release cycle. Maybe some kind of CI/CD to rebuild an image with the latest CLI would be alright

toastedcrumpets commented 4 months ago

Just an update, my branch now has working users, groups, and oauth2 clients (not tested oauth2 fully). I dropped using Jobs, as I'd need to build a custom container to build the required logic/exception handling around the CLI tool in python. So instead I changed the operator container to have the CLI tool inside it. This "pins" the operator to versions of kanidm that are compatible with that version of the CLI tool, which is not a big disadvantage.

On the todo list are service accounts, actual E2E testing of oauth/groups/users, and better documentation. I'll keep you posted!

toastedcrumpets commented 4 months ago

I'm now at a point where you might want to take a look at my branch and we can move towards a pull request? Just take a look and let me know what changes you'd need to see.

I've implemented a CI/CD pipeline in github actions that builds the operator container and pushes it to hub.docker.io. The setup of the CI pipeline was intricate, as I needed to spin up a k8s cluster, with a certificate issuer, all without actually having external access.

The CI does not do much detailed verification of the code, it just deploys the operator and examples of all the CRDs I've (re)implemented and checks these deploy without errors. I'm using this as a way of testing not only the operator, but also the example CRDs. The main thing not covered by the CI pipeline is the docker image, as I run the tests before this is built/deployed.

I've written some documentation for the Readme.md file which should be enough for others to get started. My intent is to have the example CRDs explain the options. One key point, what licence do you want to impose on your code (and thus my contributions). I'm happy with any open source licence.

I'm now deploying this operator in my homelab and will work out the remaining bugs while I integrate it with my apps like forgejo and nginx-ingress. I'll keep pushing to my branch while this happens but I don't forsee any major changes, just extensions (i.e. for service accounts which are still TDB) and bug fixes.

sbordeyne commented 4 months ago

I'm now at a point where you might want to take a look at my branch and we can move towards a pull request? Just take a look and let me know what changes you'd need to see.

I'll take a look later today, see what kind of changes need to be done. I had a look earlier, and would need to have the readme point to the "official" docker image, and official repo (instead of your fork). One change that need to stick though is that I changed the docker image to use a multi stage build (dropping its size from 800MB to barely 8MB). Operators should be very lightweight after all, there's no need for the whole python shebang to be included in the image. There might be some certificate issues though that might arise, as I haven't tested the change fully.

I've implemented a CI/CD pipeline in github actions that builds the operator container and pushes it to hub.docker.io. The setup of the CI pipeline was intricate, as I needed to spin up a k8s cluster, with a certificate issuer, all without actually having external access.

Since I originally made this to work in my homelab with traefik as an ingress controller and cert-manager as a cert issuer, I'd consider at least cert-manager as a required dependency of that operator. I'd spin up a cluster and install the cert-manager operator with a self signed issuer to test.

The CI does not do much detailed verification of the code, it just deploys the operator and examples of all the CRDs I've (re)implemented and checks these deploy without errors. I'm using this as a way of testing not only the operator, but also the example CRDs. The main thing not covered by the CI pipeline is the docker image, as I run the tests before this is built/deployed.

Fine by me. After all provided examples should "just work"

I've written some documentation for the Readme.md file which should be enough for others to get started. My intent is to have the example CRDs explain the options. One key point, what licence do you want to impose on your code (and thus my contributions). I'm happy with any open source licence.

Disagree. I'll write some "easy" examples that do not exploit every feature of the operator, but just work for simple use cases (namely, I have a homelab and want SSO with kanidm with a couple users and groups)

I'm now deploying this operator in my homelab and will work out the remaining bugs while I integrate it with my apps like forgejo and nginx-ingress. I'll keep pushing to my branch while this happens but I don't forsee any major changes, just extensions (i.e. for service accounts which are still TDB) and bug fixes.

I still need to figure out a good way to integrate major k8s ingress controllers into the operator. I personally am using traefik because it's honestly the best self hosted IC out there (well the more advanced one). I know that nginx-proxy-manager is another popular solution, nginx-ingress/ingress-nginx are used sometimes, and then there's caddy and the cloud provided ingress controllers (gke, aks etc).

Probably going to focus on traefik and nginx-proxy-manager as those are by far the most popular solutions for ingress control. Caddy would go third in the list of priorities.

Just an update, my branch now has working users, groups, and oauth2 clients (not tested oauth2 fully). I dropped using Jobs, as I'd need to build a custom container to build the required logic/exception handling around the CLI tool in python. So instead I changed the operator container to have the CLI tool inside it. This "pins" the operator to versions of kanidm that are compatible with that version of the CLI tool, which is not a big disadvantage.

As mentionned above, I had thought of that solution, and I find it acceptable, the binary being quite small after all. I'll have a deeper look at your PR since I want the code to be readable, and would need to see an actual interface to abstract away the CLI calls from the main business logic of the code (I want the main code to be readable python, and not mix in some CLI error handling and so on). Since I have not yet taken a look, I'll let you know whether changes need to be made

sbordeyne commented 4 months ago

One note I've noticed while perusing your code : it would be good to set the crds option in the example's kustomization.yaml file. Indeed, that would allow kustomize to find the CRDs and inject commonLabels, namespace and such without having to define these values in every custom resource.

toastedcrumpets commented 4 months ago

I'll take a look later today, see what kind of changes need to be done. I had a look earlier, and would need to have the readme point to the "official" docker image, and official repo (instead of your fork).

Sure, you can patch to whatever container repository you'd like.

One change that need to stick though is that I changed the docker image to use a multi stage build (dropping its size from 800MB to barely 8MB). Operators should be very lightweight after all, there's no need for the whole python shebang to be included in the image. There might be some certificate issues though that might arise, as I haven't tested the change fully.

I'll merge this in later today.

Since I originally made this to work in my homelab with traefik as an ingress controller and cert-manager as a cert issuer, I'd consider at least cert-manager as a required dependency of that operator. I'd spin up a cluster and install the cert-manager operator with a self signed issuer to test.

I've documented this dependency in the Readme, and what you're describing is exactly what the github action does for the CI tests.

One key point, what licence do you want to impose on your code (and thus my contributions). I'm happy with any open source licence.

This is still a key issue, I'm nervous that I've extended work without a proper licence in place!

Disagree. I'll write some "easy" examples that do not exploit every feature of the operator, but just work for simple use cases (namely, I have a homelab and want SSO with kanidm with a couple users and groups)

Sure, I we could perhaps have some more folders of examples in there with different approaches. I'd still like them to be covered by the unit tests/CI though, as your current repo the examples are incomplete/broken which is a frustrating user experience, would like CI to catch it, which is why I think examples in Readme.md should be avoided as they're hard to test.

I still need to figure out a good way to integrate major k8s ingress controllers into the operator.

I think we just leave a fully configurable ingress tag. I moved away from specialising around a particular ingress controller and instead put everything in annotations. There's probably more to do in generalising path of the ingress spec, but for now you can make kanidm's ingress work with any ingress controller I think?

As mentioned above, I had thought of that solution, and I find it acceptable, the binary being quite small after all. I'll have a deeper look at your PR since I want the code to be readable, and would need to see an actual interface to abstract away the CLI calls from the main business logic of the code (I want the main code to be readable python, and not mix in some CLI error handling and so on). Since I have not yet taken a look, I'll let you know whether changes need to be made

Sure, I noticed you like your abstractions! Will look forward to hearing back.

One note I've noticed while perusing your code : it would be good to set the crds option in the example's kustomization.yaml file. Indeed, that would allow kustomize to find the CRDs and inject commonLabels, namespace and such without having to define these values in every custom resource.

That's fine, at the moment I'm deploying these one by one in the CI tests which is why everything is specified fully without the kustomization.yaml, so that might be avoided in the current examples, but we could create a kustomization example folder, as well as this "full features testing" example folder?

toastedcrumpets commented 4 months ago

I followed your latest commit on the docker image and made some changes to the docker file to separate build/deploy images. I had to make one significant change, I didn't use alpine for the final image as its got different placement of the libudev.so library than slim (and is missing the version of libgcc_s.so), so I'm using slim for the final for now. The current total image size is 190Mb. 3.11-alpine base is 18.97mb while 3.11-slim is 46.68Mb so I can't blame all of the image size on that! Will look for any unneeded deps.