kmesh-net / kmesh

High Performance ServiceMesh Data Plane Based on Programmable Kernel
https://kmesh.net
Apache License 2.0
362 stars 46 forks source link

After delete waypoint the traffic is still routed to waypoint #406

Closed hzxuzhonghu closed 3 weeks ago

hzxuzhonghu commented 1 month ago

What happened:

Traffic broken after waypoint deleted


            curl-3939777 [001] d...1 3011670.255015: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.41.9:9080]

            curl-3939777 [001] d...1 3011670.255027: bpf_trace_printk: [KMESH] DEBUG: bpf find frontend addr=[10.96.41.9:9080]

           <...>-3939784 [001] d...1 3011670.258514: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.32.78:9080]

           <...>-3939784 [001] d...1 3011670.258521: bpf_trace_printk: [KMESH] DEBUG: bpf find frontend addr=[10.96.32.78:9080]

          python-3939784 [000] d...1 3011670.262110: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.165.244:9080]

          python-3939784 [000] d...1 3011670.262116: bpf_trace_printk: [KMESH] DEBUG: bpf find frontend addr=[10.96.165.244:9080]

          python-3939784 [000] d...1 3011670.262119: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.239.171:15008]

          python-3939784 [001] d...1 3011673.267123: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.165.244:9080]

          python-3939784 [001] d...1 3011673.267131: bpf_trace_printk: [KMESH] DEBUG: bpf find frontend addr=[10.96.165.244:9080]

          python-3939784 [001] d...1 3011673.267134: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.239.171:15008]

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

hzxuzhonghu commented 1 month ago

10.96.239.171 is the previous waypoint svc cluster ip, but it does not exist now

bfforever commented 4 weeks ago

I had met same issue, it seems that when waypoint uninstalled, kmesh_backend bpf map not updated, end up to when next time access backend will also get the waypoint address. image The last 8 bytes represent the waypoint addr and port.

hzxuzhonghu commented 4 weeks ago

Good, i cannot reproduce easily later.

hzxuzhonghu commented 4 weeks ago

BTW, i have added support dumpoing workloads. Which maybe used to check the workload configs

hzxuzhonghu commented 4 weeks ago

Reproduced now:

steps:

  1. create a waypoint for svc
  2. test it work as expected
  3. delete the gateway, waypoint deleted
  4. create the gateway, waypoint created
  5. test service access again, Now i can see from the bpf tracelog, the traffic is still routed to the old waypoint svc
          python-226991  [000] d...1 3387970.549843: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.165.244:9080]

          python-226991  [000] d...1 3387970.549851: bpf_trace_printk: [KMESH] DEBUG: bpf find frontend addr=[10.96.165.244:9080]

          python-226991  [000] d...1 3387970.549854: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.119.44:15008]   // This is the stale service clusters ip

But with the newly added dump, the userspace waypoint is already updated

k exec -ti kmesh-ptwtd   -n kmesh-system -- curl 127.0.0.1:15200/debug/config_dump/workload

        {
            "name": "reviews",
            "namespace": "default",
            "hostname": "reviews.default.svc.cluster.local",
            "vips": [
                "/10.96.165.244"
            ],
            "ports": [
                {
                    "service_port": 9080,
                    "target_port": 9080
                }
            ],
            "loadBalancer": null,
            "waypoint": {
                "destination": "/10.96.126.207"
            }
        },
hzxuzhonghu commented 4 weeks ago

And the weired point is why the bpf map is not updated

hzxuzhonghu commented 4 weeks ago

I figurred out the cause: we did not update the service map when a service updated

    serviceId := p.hashName.StrToNum(serviceName)
    sk.ServiceId = serviceId
    // if service has exist, just need update frontend port info
    if err = p.bpf.ServiceLookup(&sk, &sv); err == nil {
        // update: delete then store
        if err = p.deleteFrontendData(serviceId); err != nil {
            log.Errorf("deleteFrontendData failed: %s", err)
            return err
        }
        if err = p.storeServiceFrontendData(serviceId, service); err != nil {
            log.Errorf("storeServiceFrontendData failed, err:%s", err)
            return err
        }
    } 
hzxuzhonghu commented 4 weeks ago

@bfforever Can you help fix it

bfforever commented 4 weeks ago

@bfforever Can you help fix it

okay.

bfforever commented 4 weeks ago

/assign

hzxuzhonghu commented 4 weeks ago

When waypoint is deleted, we do not update the service(which uses the waypoint) map either

bfforever commented 3 weeks ago

Reproduced now:

steps:

  1. create a waypoint for svc
  2. test it work as expected
  3. delete the gateway, waypoint deleted
  4. create the gateway, waypoint created
  5. test service access again, Now i can see from the bpf tracelog, the traffic is still routed to the old waypoint svc
          python-226991  [000] d...1 3387970.549843: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.165.244:9080]

          python-226991  [000] d...1 3387970.549851: bpf_trace_printk: [KMESH] DEBUG: bpf find frontend addr=[10.96.165.244:9080]

          python-226991  [000] d...1 3387970.549854: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.119.44:15008]   // This is the stale service clusters ip

But with the newly added dump, the userspace waypoint is already updated

k exec -ti kmesh-ptwtd   -n kmesh-system -- curl 127.0.0.1:15200/debug/config_dump/workload

        {
            "name": "reviews",
            "namespace": "default",
            "hostname": "reviews.default.svc.cluster.local",
            "vips": [
                "/10.96.165.244"
            ],
            "ports": [
                {
                    "service_port": 9080,
                    "target_port": 9080
                }
            ],
            "loadBalancer": null,
            "waypoint": {
                "destination": "/10.96.126.207"
            }
        },

Could you provide me an example, how waypoint proxy for a certain service, because I only find waypoint used for a certain ServiceAccount or namespace. Currently I can not reproduce your situation.

hzxuzhonghu commented 3 weeks ago

I did test it with istio 1.22, and the usage of waypoint has changed https://istio.io/latest/docs/ambient/usage/waypoint/#configure-a-service-to-use-a-specific-waypoint

nlgwcy commented 3 weeks ago

I figurred out the cause: we did not update the service map when a service updated

  serviceId := p.hashName.StrToNum(serviceName)
  sk.ServiceId = serviceId
  // if service has exist, just need update frontend port info
  if err = p.bpf.ServiceLookup(&sk, &sv); err == nil {
      // update: delete then store
      if err = p.deleteFrontendData(serviceId); err != nil {
          log.Errorf("deleteFrontendData failed: %s", err)
          return err
      }
      if err = p.storeServiceFrontendData(serviceId, service); err != nil {
          log.Errorf("storeServiceFrontendData failed, err:%s", err)
          return err
      }
  } 
serviceId := p.hashName.StrToNum(serviceName)

If hash conflict occurs, serviceId may be different from the old one. As discussed, we need a stable str -> id conversion algorithm.

hzxuzhonghu commented 3 weeks ago

Not the hash issue, our scale is not that large.

hzxuzhonghu commented 3 weeks ago

@bfforever https://github.com/kmesh-net/kmesh/compare/main...hzxuzhonghu:fix-bpf-map-update?expand=1