harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.75k stars 314 forks source link

[Bug] Load Balancer Deployment Fails in Both Guest and Harvester Cluster Scenarios #5033

Open dimepues opened 7 months ago

dimepues commented 7 months ago

Describe the bug Experiencing failures when deploying load balancers in guest clusters and directly within the Harvester cluster. Despite creating appropriate IP pools, load balancers fail to assign IPs or encounter timeout errors.

To Reproduce

Expected behavior Load balancers should successfully deploy, acquire an IP from the pool, and operate without errors.

Troubleshooting Steps Taken

Errors Encountered

Support bundle supportbundle_d8c90104-75c7-4adb-bfef-14f4f49f1c00_2024-01-26T03-43-46Z.zip

Environment

Additional context

w13915984028 commented 7 months ago

This issue, it quite possiblly similar to https://github.com/harvester/harvester/issues/5072#issuecomment-1920653841

the ipam.NewAllocator does not finish the initialization on time, and set a status to indicate that it can start to serve new allocations

func (h *Handler) OnChange(_ string, ipPool *lbv1.IPPool) (*lbv1.IPPool, error) {
    previousAllocator := h.allocatorMap.Get(ipPool.Name)
    if previousAllocator == nil || previousAllocator.CheckSum() != ipam.CalculateCheckSum(ipPool.Spec.Ranges) {
        a, err := ipam.NewAllocator(ipPool.Name, ipPool.Spec.Ranges, h.ipPoolCache, h.ipPoolClient)
w13915984028 commented 7 months ago

add the debug informaton:

duplicated IP allocaton:

LB:

apiVersion: [loadbalancer.harvesterhci.io/v1beta1](http://loadbalancer.harvesterhci.io/v1beta1)
kind: LoadBalancer
metadata:
  annotations:
    [cloudprovider.harvesterhci.io/service-uuid](http://cloudprovider.harvesterhci.io/service-uuid): 305b4f79-ceff-4fc1-be08-17c740cd24f9
    [loadbalancer.harvesterhci.io/namespace](http://loadbalancer.harvesterhci.io/namespace): default
    [loadbalancer.harvesterhci.io/network](http://loadbalancer.harvesterhci.io/network): ''
    [loadbalancer.harvesterhci.io/project](http://loadbalancer.harvesterhci.io/project): c-m-kb9nwxh2/p-kfl9f
  creationTimestamp: '2024-01-24T19:18:25Z'
  finalizers:
    - [wrangler.cattle.io/harvester-lb-controller](http://wrangler.cattle.io/harvester-lb-controller)
  generation: 10
  labels:
    [cloudprovider.harvesterhci.io/cluster](http://cloudprovider.harvesterhci.io/cluster): dev
  name: dev-argocd-lb-09a33510
  namespace: default
  resourceVersion: '9013702'
  uid: 0c3e2048-cbd5-4381-997b-189839651835
spec:
  backendServerSelector:
    [harvesterhci.io/vmName](http://harvesterhci.io/vmName):
      - dev-pool1-62d36532-2mchw
      - dev-pool1-62d36532-djjzc
      - dev-pool1-62d36532-wcwhq
  ipam: pool
  listeners:
    - backendPort: 30657
      name: http
      port: 80
      protocol: TCP
    - backendPort: 32215
      name: https
      port: 443
      protocol: TCP
status:
  backendServers:
    - 192.168.112.21
    - 192.168.112.22
    - 192.168.112.20
  conditions:
    - lastUpdateTime: '2024-01-24T19:18:35Z'
      message: >-
        allocate ip for lb default/dev-argocd-lb-09a33510 failed, error:
        192.168.112.9 has been allocated to default/dev-argocd-lb-09a33510,
        duplicate allocation is not allowed
      status: 'False'
      type: Ready

IPPool:

apiVersion: [loadbalancer.harvesterhci.io/v1beta1](http://loadbalancer.harvesterhci.io/v1beta1)
kind: IPPool
metadata:
  creationTimestamp: '2024-01-24T17:00:14Z'
  finalizers:
    - [wrangler.cattle.io/harvester-ipam-controller](http://wrangler.cattle.io/harvester-ipam-controller)
  generation: 34
  labels:
    [loadbalancer.harvesterhci.io/global-ip-pool](http://loadbalancer.harvesterhci.io/global-ip-pool): 'true'
    [loadbalancer.harvesterhci.io/vid](http://loadbalancer.harvesterhci.io/vid): '112'
  managedFields:
    - apiVersion: [loadbalancer.harvesterhci.io/v1beta1](http://loadbalancer.harvesterhci.io/v1beta1)
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:ranges: {}
          f:selector:
            .: {}
            f:network: {}
            f:scope: {}
      manager: harvester
      operation: Update
      time: '2024-01-24T19:29:09Z'
    - apiVersion: [loadbalancer.harvesterhci.io/v1beta1](http://loadbalancer.harvesterhci.io/v1beta1)
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"[wrangler.cattle.io/harvester-ipam-controller](http://wrangler.cattle.io/harvester-ipam-controller)": {}
        f:status:
          .: {}
          f:allocated:
            .: {}
            f:192.168.112.2: {}
            f:192.168.112.3: {}
            f:192.168.112.4: {}
            f:192.168.112.5: {}
            f:192.168.112.6: {}
            f:192.168.112.7: {}
            f:192.168.112.8: {}
            f:192.168.112.9: {}
          f:available: {}
          f:conditions: {}
          f:lastAllocated: {}
          f:total: {}
      manager: harvester-load-balancer
      operation: Update
      time: '2024-01-24T19:29:09Z'
  name: global-ip-pool
  resourceVersion: '9013591'
  uid: f1cbd5ca-8dcd-4260-992d-cc24f58d276e
spec:
  ranges:
    - gateway: 192.168.112.1
      rangeEnd: 192.168.112.9
      rangeStart: 192.168.112.2
      subnet: 192.168.112.0/24
  selector:
    network: default/k8s
    scope:
      - guestCluster: '*'
        namespace: '*'
        project: '*'
status:
  allocated:
    192.168.112.2: default/dev-argocd-lb-81963e40
    192.168.112.3: default/dev-argocd-lb-98940262
    192.168.112.4: default/dev-argocd-lb-f6583253
    192.168.112.5: default/dev-argocd-lb-55e4ea27
    192.168.112.6: default/dev-argocd-lb-43fe840d
    192.168.112.7: default/dev-argocd-lb-b5a5dc14
    192.168.112.8: default/dev-argocd-lb-18edf10b
    192.168.112.9: default/dev-argocd-lb-09a33510
  available: 0
  conditions:
    - lastUpdateTime: '2024-01-24T17:00:14Z'
      status: 'True'
      type: Ready
  lastAllocated: 192.168.112.9
  total: 8
w13915984028 commented 7 months ago

The IPPool object has following situation:

t has allocated, but the AllocatedHistory is empty, it caused this part of code fails to reuse the already allocated IP:

https://github.com/harvester/load-balancer-harvester/blob/eea7123837920134ca6b4e9106828afb7f8290e7/pkg/ipam/allocator.go#L199

status:
  allocated:
    192.168.112.2: default/dev-argocd-lb-81963e40
    192.168.112.3: default/dev-argocd-lb-98940262
    192.168.112.4: default/dev-argocd-lb-f6583253
    192.168.112.5: default/dev-argocd-lb-55e4ea27
    192.168.112.6: default/dev-argocd-lb-43fe840d
    192.168.112.7: default/dev-argocd-lb-b5a5dc14
    192.168.112.8: default/dev-argocd-lb-18edf10b
    192.168.112.9: default/dev-argocd-lb-09a33510
  available: 0
type IPPoolStatus struct {
    Total int64 `json:"total"`

    Available int64 `json:"available"`

    LastAllocated string `json:"lastAllocated"`
    // +optional
    Allocated map[string]string `json:"allocated,omitempty"`
    // +optional
    AllocatedHistory map[string]string `json:"allocatedHistory,omitempty"`
    // +optional
    Conditions []Condition `json:"conditions,omitempty"`
}
w13915984028 commented 7 months ago

@dimepues I guess your cluster was rebooted at about 2024.01.26; but the LB and IP pool posted above was happened at 2024.01.24; the support-bundle did not include corresponding information.

When you can reproduce this bug, please help reproduce it and then generate a new support-bundle file. thanks. I have some clues, and need the supportbundle file to double check.

If your workloads are deployed in none-default namespace, please remember to add them per https://docs.harvesterhci.io/v1.2/advanced/index#support-bundle-namespaces

For this error

      message: >-
        allocate ip for lb default/dev-argocd-lb-09a33510 failed, error:
        192.168.112.9 has been allocated to default/dev-argocd-lb-09a33510,
        duplicate allocation is not allowed

I get the root cause, and will submit a PR to fix.

dimepues commented 7 months ago

@w13915984028 - wow thanks so much. Apologies for the delay in responding as I was away.

w13915984028 commented 4 months ago

The current embedded *allocator.IPAllocator does not seem to a good solution for loadbalancer now: https://github.com/w13915984028/load-balancer-harvester/blob/712f152677cf3a224f9ce5345639c02b85526554/pkg/ipam/allocator.go#L24

(1) It is a single-way iteration to allocate IP, for the released IP, it seems can't re-use (2) There is no good way to initialize when some IPs have already been allocated, a potential deadloop is there.

We are looking for a better solution.

harvesterhci-io-github-bot commented 2 months ago

Pre Ready-For-Testing Checklist

harvesterhci-io-github-bot commented 2 months ago

Automation e2e test issue: harvester/tests#1382

w13915984028 commented 2 weeks ago

This issue was caused by a bug: the IPPool allocation can assume and refuse to allocate IP saying duplicate allocation is not allowed accidently.

Test plan (Harvester ISO should be built after 2024.08.30):

The LB can be created without the requirement of the existing of any VM, just use a selector which points to something. Thus the LB and IPPool can be tested separately.

(1) Create an IPPool with a range of IPs, e.g. 10 IPs

(2) Create &/ Delete LBs (allocate IP from the above pool) in batch and quickly; the LB should either get an IP, or show error that no IP is available; but have no similar issue that duplicate allocation is not allowed.