stackitcloud / yawol

yawol is a Load Balancer solution for OpenStack, based on the Kubernetes controller pattern.
Apache License 2.0
45 stars 5 forks source link

Installation not working #96

Open mauhau opened 1 year ago

mauhau commented 1 year ago

I tried to follow the installation as described in the README and failed at several points.

Image Build

First I failed at the terraform apply for the build infrastructure and hat to adopt the name of the floating ip network:

diff --git a/hack/packer-infrastructure/variables.tf b/hack/packer-infrastructure/variables.tf
index 1ea4f4c..c366e21 100644
--- a/hack/packer-infrastructure/variables.tf
+++ b/hack/packer-infrastructure/variables.tf
@@ -1,4 +1,4 @@
 variable "floating_ip_network_name" {
-  default = "floating-net"
+  default = "extnet"
   description = "Name of the network your floating IPs are hosted in"
 }

And set a DNS server for the created network:

diff --git a/hack/packer-infrastructure/network.tf b/hack/packer-infrastructure/network.tf
index 3fa8859..b90df5e 100644
--- a/hack/packer-infrastructure/network.tf
+++ b/hack/packer-infrastructure/network.tf
@@ -8,6 +8,7 @@ resource "openstack_networking_subnet_v2" "packer" {
   network_id = openstack_networking_network_v2.packer.id
   cidr       = "192.168.48.0/24"
   ip_version = 4
+  dns_nameservers = [ "8.8.8.8" ]
 }

Next fail was at the earthy build. To make that work I had to adopt the flavor type and volume type:

diff --git a/image/alpine-yawol.pkr.hcl b/image/alpine-yawol.pkr.hcl
index 13531da..b231088 100644
--- a/image/alpine-yawol.pkr.hcl
+++ b/image/alpine-yawol.pkr.hcl
@@ -35,7 +35,7 @@ variable "security_group_id" {

 variable "machine_flavor" {
   type        = string
-  default     = "c1.2"
+  default     = "my_tiny_flavor"
   description = "The ID, name, or full URL for the desired flavor for the server to be created."
 }

@@ -65,7 +65,7 @@ source "openstack" "yawollet" {
   ssh_username            = "alpine"
   use_blockstorage_volume = true
   volume_size             = 1
-  volume_type             = "storage_premium_perf6"
+  volume_type             = "__DEFAULT__"
   ssh_timeout             = "10m"
   image_tags              = var.image_tags
 }

After that I got the yawol-alpine-v0.11.0-1 image.

Cluster Installation

At the cluster installation it would be nice to have a link how to install VerticalPodAutoscaler (which is not that hard to find, but would make life easier :smile_cat: ).

I had to create the Secret in namespace kube-system otherwise the controller did not find it.

In the Secret there is some tenant-name which is no (more) used in OpenStack. So it's not clear was needs to be provided here.

The images referenced in the helm chart do not exist, I adopted the values.yaml to make that work:

diff --git a/charts/yawol-controller/values.yaml b/charts/yawol-controller/values.yaml
index acc96a9..a423ff8 100644
--- a/charts/yawol-controller/values.yaml
+++ b/charts/yawol-controller/values.yaml
@@ -17,13 +17,13 @@ yawolCloudController:
   clusterRoleEnabled: true
   image:
     repository: ghcr.io/stackitcloud/yawol/yawol-cloud-controller
-    tag: latest
+    tag: v0.11.0

 yawolController:
   gardenerMonitoringEnabled: false
   image:
     repository: ghcr.io/stackitcloud/yawol/yawol-controller
-    tag: latest
+    tag: v0.11.0

The provided example service is not working out-of-the-box, because of the set yawol.stackit.cloud/className: "test". After commenting this out, the service gets handled by the yawol controller.

Finally I have to give up with this error:

1.6742993640043612e+09  ERROR   Reconciler error    {"controller": "loadbalancer", "controllerGroup": "yawol.stackit.cloud", "controllerKind": "LoadBalancer", "loadBalancer": {"name":"yawol-test--lb-test","namespace":"kube-system"}, "namespace": "kube-system", "name": "yawol-test--lb-test", "reconcileID": "bd3a2b22-c3ff-48f4-bc0f-a7d897195d18", "error": "You must provide exactly one of DomainID or DomainName to authenticate by Username"}

Any hints are welcome as I would like to see it running :space_invader:

dergeberl commented 1 year ago

Hi @mauhau,

thank you for sharing your experience and showing the weak points of the documentation.

Image build

I will add there some more docs to show which settings depends on the openstack installation.

Cluster Installation

You are right in the default helm values for yawol a VerticalPodAutoscaler is needed. You can disable VPA in the Helm values with vpa.enabled: false. I think we should disable it by default or check somehow if VPA is available in the k8s cluster (what do you think @breuerfelix @einfachnuralex).

You are right tenant-name is not used in openstack. I looked it up and currently it is needed to set it to the ProjectName (so for now please set the tenant-name to the name of your openstack project). Also the project-name which is shown in the docs, is not used. I will "fix" that and also add more info in the docs about it.

@breuerfelix should we build also latest or pin fixed image in the helm values? I think pinned images makes more sense because latest would result in an update on the cluster without the crds getting updated.

The example service is used for the development environment (docs). This setting is needed to be able to run multible yawol-cloud-controllers. In the default every k8s service of type LoadBalancer without that is is used by yawol.

Your last error looks like a error with the login into openstack. I think the domain-name is missing in your secret.

I hope this helps you to get it running 😃

mauhau commented 1 year ago

Great, thank you.

For the documentation of the secret I would propose something like this:

apiVersion: v1
kind: Secret
metadata:
  name: cloud-provider-config
type: Opaque
stringData:
  cloudprovider.conf: |-
    [Global]
    auth-url="<OS_AUTH_URL>"
    domain-name="<OS_USER_DOMAIN_NAME>"
    tenant-name="<OS_PROJECT_NAME>"
    project-name="<OS_PROJECT_NAME>"
    username="<OS_USERNAME>"
    password="<OS_PASSWORD>"
    region="<OS_REGION_NAME>"

After setting the secret correctly yawol creates SecGroup, FloatingIP and Port, but no Server:

1.6744837509860816e+09  INFO    controller.LoadBalancer Reconcile Openstack     {"lb": "yawol-test--lb-test3"}
1.6744837509861178e+09  INFO    controller.LoadBalancer Reconcile SecGroup      {"lb": "yawol-test--lb-test3"}
1.6744837514294589e+09  INFO    controller.LoadBalancer Create SecGroup {"lb": "yawol-test--lb-test3"}
1.6744837517297585e+09  INFO    controller.LoadBalancer Reconcile SecGroupRules {"lb": "yawol-test--lb-test3"}
1.6744837517299604e+09  DEBUG   events  Warning {"object": {"kind":"LoadBalancer","namespace":"kube-system","name":"yawol-test--lb-test3","uid":"d3f44e80-0747-4eef-98dd-d0f66e23fcd5","apiVersion":"yawol.stackit.cloud/v1beta1","resourceVersion":"1016080"}, "reason": "Warning", "message": "DebugSettings are enabled, Port 22 is open to all IP ranges."}
1.6744837533923779e+09  INFO    controller.LoadBalancer Reconcile FloatingIP    {"lb": "yawol-test--lb-test3"}
1.6744837534356053e+09  INFO    controller.LoadBalancer Create FloatingIP       {"lb": "yawol-test--lb-test3"}
1.6744837538124757e+09  INFO    controller.LoadBalancer Update ExternalIP       {"lb": "yawol-test--lb-test3"}
1.6744837538338878e+09  INFO    controller.LoadBalancer Reconcile Port  {"lb": "yawol-test--lb-test3"}
1.6744837539900713e+09  INFO    controller.LoadBalancer Create Port     {"lb": "yawol-test--lb-test3"}
1.6744837544405751e+09  INFO    controller.LoadBalancer successfully created port       {"id": "5a673b2a-6225-453e-bd73-c798f1f36d88", "lb": "kube-system/yawol-test--lb-test3"}
1.6744837549955077e+09  INFO    controller.LoadBalancer Reconcile FIPAssociate  {"lb": "yawol-test--lb-test3"}
1.674483755012449e+09   INFO    controller.LoadBalancer Bind FloatingIP to Port {"lb": "yawol-test--lb-test3"}
1.6744837557283204e+09  INFO    controller.LoadBalancer Reconcile Openstack     {"lb": "yawol-test--lb-test3"}
1.6744837557283442e+09  INFO    controller.LoadBalancer Reconcile SecGroup      {"lb": "yawol-test--lb-test3"}
1.6744837559698987e+09  INFO    controller.LoadBalancer Reconcile SecGroupRules {"lb": "yawol-test--lb-test3"}
1.6744837559700172e+09  DEBUG   events  Warning {"object": {"kind":"LoadBalancer","namespace":"kube-system","name":"yawol-test--lb-test3","uid":"d3f44e80-0747-4eef-98dd-d0f66e23fcd5","apiVersion":"yawol.stackit.cloud/v1beta1","resourceVersion":"1016099"}, "reason": "Warning", "message": "DebugSettings are enabled, Port 22 is open to all IP ranges."}
1.6744837564755807e+09  INFO    controller.LoadBalancer Reconcile FloatingIP    {"lb": "yawol-test--lb-test3"}
1.6744837564999213e+09  INFO    controller.LoadBalancer Reconcile Port  {"lb": "yawol-test--lb-test3"}
1.674483756517218e+09   INFO    controller.LoadBalancer Reconcile FIPAssociate  {"lb": "yawol-test--lb-test3"}
1.6744839279968724e+09  INFO    controller.LoadBalancer LoadBalancer not found  {"lb": "kube-system/yawol-test--lb-test"}
1.6744840405738192e+09  INFO    controller.LoadBalancer LoadBalancer not found  {"lb": "kube-system/yawol-test--lb-test2"}

Any hints how to debug/fix this?

mauhau commented 1 year ago

In the Log of the yawol-controller-loadbalancermachine I can find this error:

1.6744890245824962e+09  INFO    controller.LoadBalancerMachine  Check SA        {"loadBalancerMachineName": "default--lb-test-zizxkwex25rbtquw-cdbe6"}
1.674489024582533e+09   INFO    controller.LoadBalancerMachine  Check role      {"loadBalancerMachineName": "default--lb-test-zizxkwex25rbtquw-cdbe6"}
1.6744890245825515e+09  INFO    controller.LoadBalancerMachine  Check rolebinding       {"loadBalancerMachineName": "default--lb-test-zizxkwex25rbtquw-cdbe6"}
1.6744890250379155e+09  ERROR   Reconciler error        {"controller": "loadbalancermachine", "controllerGroup": "yawol.stackit.cloud", "controllerKind": "LoadBalancerMachine", "loadBalancerMachine": {"name":"default--lb-test-zizxkwex25rbtquw-cdbe6","namespace":"kube-system"}, "namespace": "kube-system", "name": "default--lb-test-zizxkwex25rbtquw-cdbe6", "reconcileID": "5d85e00a-3f04-4ee5-ba88-504a1b268c0c", "error": "secret not found for serviceAccount default--lb-test-zizxkwex25rbtquw-cdbe6"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234

Yes, I know the service name changed :smile:

mauhau commented 1 year ago

I think we found the problem.

We are using kubernetes 1.25.5. After manually creating the secret for the service account reconciled by yawol-controller following this documentation and adding the secret to the service account we finally got a working loadbalancer machine :space_invader:

kubectl -n kube-system apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: default--lb-test-secret
  annotations:
    kubernetes.io/service-account.name: default--loadbalancer-rfeqiyrg7dcknhf3-1b890
type: kubernetes.io/service-account-token
EOF

root@garden-base-k8s-master-nf-1:~# kubectl -n kube-system get serviceaccounts default--loadbalancer-rfeqiyrg7dcknhf3-1b890 -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  creationTimestamp: "2023-01-23T16:50:12Z"
  name: default--loadbalancer-rfeqiyrg7dcknhf3-1b890
  namespace: kube-system
  resourceVersion: "1041342"
  uid: 64f00b42-32e4-42ae-885b-380a8edab571
secrets:
- name: default--lb-test-secret

So maybe you should mention which kubernetes versions are currently supported by yawol.

dergeberl commented 1 year ago

Yes that is correct currently it is only with 1.23 or less. We will add support for that soon :)

I created some issues to track that topics: #98 #99 #100 #101

dergeberl commented 1 year ago

Hi @mauhau,

thanks again for your feedback. We solved all your points in the version v0.12.0:

We would be happy if you could look over the changes and give us feedback again.

mauhau commented 1 year ago

Hey @dergeberl,

thank you for the update! I will check that soon and will update the issue. Sounds great!

Another question not related to this issue: If we want to use the yawol loadbalancer for Gardener Shoot clusters, is there any documentation how to do that?

nschad commented 1 year ago

Hey @dergeberl,

thank you for the update! I will check that soon and will update the issue. Sounds great!

Another question not related to this issue: If we want to use the yawol loadbalancer for Gardener Shoot clusters, is there any documentation how to do that?

If you use gardener too then you would most likely write an extension for yawol. At least that is what we did, unfortunately that extension is not public yet.

mauhau commented 1 year ago

I just tested the updates. It works almost perfect now.

The only problem which remained is the Secret cloud-provider-config. There is the namespace kube-system missing and it is not allowed to set as well domain-name and domain-id:

ERROR   Reconciler error    {"controller": "loadbalancer", "controllerGroup": "yawol.stackit.cloud", "controllerKind": "LoadBalancer", "loadBalancer": {"name":"default--nginx","namespace":"kube-system"}, "namespace": "kube-system", "name": "default--nginx", "reconcileID": "d78086e0-e213-483d-92d1-cb2d11c6c936", "error": "You must provide exactly one of DomainID or DomainName to authenticate by Username"}

Hence I would suggest it in the documentation like this:

apiVersion: v1
kind: Secret
metadata:
  name: cloud-provider-config
  namespace: kube-system
type: Opaque
stringData:
  cloudprovider.conf: |-
    [Global]
    auth-url="<OS_AUTH_URL>"
    domain-name="<OS_USER_DOMAIN_NAME>"
    # specifiy either domain-name or domain-id (mutual exclusive)
    # domain-id="<OS_DOMAIN_ID>"
    # Deprecated (tenant-name): Please use project-name, only used if project-name is not set.
    # tenant-name="<OS_PROJECT_NAME>"
    project-name="<OS_PROJECT_NAME>"
    project-id="<OS_PROJECT_ID>"
    username="<OS_USERNAME>"
    password="<OS_PASSWORD>"
    region="<OS_REGION_NAME>"

Everything else is working. Great job :+1:

You could improve the documentation for build with a hint earthly +build-packer-environment... will show required variables for Build yawol OpenStack Image.

And installation / use of Yawol could be easier if helm charts are available as tgz artifact in the release as well as the ready OpenStack image. But you probably know that already.

mauhau commented 1 year ago

And just found another one.

The CRD allows not the values for serverGroupPolicy as documented:

2023-02-03T16:35:33.600761470Z 1.6754421336006477e+09   ERROR   Reconciler error    {"controller": "loadbalancer", "controllerGroup": "yawol.stackit.cloud", "controllerKind": "LoadBalancer", "loadBalancer": {"name":"default--nginx","namespace":"kube-system"}, "namespace": "kube-system", "name": "default--nginx", "reconcileID": "f5e3dac9-c35b-4137-b979-e182837ca9f0", "error": "Bad request with: [POST https://***:32443/v2.1/os-server-groups], error message: {\"badRequest\": {\"code\": 400, \"message\": \"Invalid input for field/attribute 0. Value: soft-anti-affinity. 'soft-anti-affinity' is not one of ['anti-affinity', 'affinity']\"}}"
nschad commented 1 year ago

I just tested the updates. It works almost perfect now.

The only problem which remained is the Secret cloud-provider-config. There is the namespace kube-system missing and it is not allowed to set as well domain-name and domain-id:

ERROR Reconciler error    {"controller": "loadbalancer", "controllerGroup": "yawol.stackit.cloud", "controllerKind": "LoadBalancer", "loadBalancer": {"name":"default--nginx","namespace":"kube-system"}, "namespace": "kube-system", "name": "default--nginx", "reconcileID": "d78086e0-e213-483d-92d1-cb2d11c6c936", "error": "You must provide exactly one of DomainID or DomainName to authenticate by Username"}

Hence I would suggest it in the documentation like this:

apiVersion: v1
kind: Secret
metadata:
  name: cloud-provider-config
  namespace: kube-system
type: Opaque
stringData:
  cloudprovider.conf: |-
    [Global]
    auth-url="<OS_AUTH_URL>"
    domain-name="<OS_USER_DOMAIN_NAME>"
    # specifiy either domain-name or domain-id (mutual exclusive)
    # domain-id="<OS_DOMAIN_ID>"
    # Deprecated (tenant-name): Please use project-name, only used if project-name is not set.
    # tenant-name="<OS_PROJECT_NAME>"
    project-name="<OS_PROJECT_NAME>"
    project-id="<OS_PROJECT_ID>"
    username="<OS_USERNAME>"
    password="<OS_PASSWORD>"
    region="<OS_REGION_NAME>"

Everything else is working. Great job 👍

This is already documented. Just above the example :)

Note: At most one of domain-id or domain-name and project-id or project-name must be provided.

And just found another one.

The CRD allows not the values for serverGroupPolicy as documented:

2023-02-03T16:35:33.600761470Z 1.6754421336006477e+09 ERROR   Reconciler error    {"controller": "loadbalancer", "controllerGroup": "yawol.stackit.cloud", "controllerKind": "LoadBalancer", "loadBalancer": {"name":"default--nginx","namespace":"kube-system"}, "namespace": "kube-system", "name": "default--nginx", "reconcileID": "f5e3dac9-c35b-4137-b979-e182837ca9f0", "error": "Bad request with: [POST https://***:32443/v2.1/os-server-groups], error message: {\"badRequest\": {\"code\": 400, \"message\": \"Invalid input for field/attribute 0. Value: soft-anti-affinity. 'soft-anti-affinity' is not one of ['anti-affinity', 'affinity']\"}}"

The error is coming from your openstack API. It seems like your environment doesn't support it.

mauhau commented 1 year ago

Ah. Now I see the Note above the example :smile:

The other point is weird. My OpenStack is fine with soft-anti-affinity:

(openstack) $ openstack server group list
+--------------------------------------+-------------------+--------------------+
| ID                                   | Name              | Policy             |
+--------------------------------------+-------------------+--------------------+
| fa8aa365-0070-4e1e-a9dd-501933653b63 | k8s-node-srvgrp   | soft-anti-affinity |
| d9177be6-6718-4516-8b7c-a65f0d5e03e8 | k8s-master-srvgrp | soft-anti-affinity |
+--------------------------------------+-------------------+--------------------+

But I also see in the log it was returned by OpenStack API.

nschad commented 1 year ago

Ah. Now I see the Note above the example 😄

The other point is weird. My OpenStack is fine with soft-anti-affinity:

(openstack) $ openstack server group list
+--------------------------------------+-------------------+--------------------+
| ID                                   | Name              | Policy             |
+--------------------------------------+-------------------+--------------------+
| fa8aa365-0070-4e1e-a9dd-501933653b63 | k8s-node-srvgrp   | soft-anti-affinity |
| d9177be6-6718-4516-8b7c-a65f0d5e03e8 | k8s-master-srvgrp | soft-anti-affinity |
+--------------------------------------+-------------------+--------------------+

But I also see in the log it was returned by OpenStack API.

If I had to guess I would probably say that you are hitting an old API. Sometimes there a multiple API versions for different openstack identity providers. But I'm not an openstack expert 🤷, sorry.

mauhau commented 1 year ago

Good point. Nova API of my OpenStack Yoga returns version 2.90 and according to the OpenStack API documentation this should be fine. Stays weird :confused: