Closed hzxuzhonghu closed 3 weeks ago
10.96.239.171
is the previous waypoint svc cluster ip, but it does not exist now
I had met same issue, it seems that when waypoint uninstalled, kmesh_backend bpf map not updated, end up to when next time access backend will also get the waypoint address.
The last 8 bytes represent the waypoint addr and port.
Good, i cannot reproduce easily later.
BTW, i have added support dumpoing workloads. Which maybe used to check the workload configs
Reproduced now:
steps:
python-226991 [000] d...1 3387970.549843: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.165.244:9080]
python-226991 [000] d...1 3387970.549851: bpf_trace_printk: [KMESH] DEBUG: bpf find frontend addr=[10.96.165.244:9080]
python-226991 [000] d...1 3387970.549854: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.119.44:15008] // This is the stale service clusters ip
But with the newly added dump, the userspace waypoint is already updated
k exec -ti kmesh-ptwtd -n kmesh-system -- curl 127.0.0.1:15200/debug/config_dump/workload
{
"name": "reviews",
"namespace": "default",
"hostname": "reviews.default.svc.cluster.local",
"vips": [
"/10.96.165.244"
],
"ports": [
{
"service_port": 9080,
"target_port": 9080
}
],
"loadBalancer": null,
"waypoint": {
"destination": "/10.96.126.207"
}
},
And the weired point is why the bpf map is not updated
I figurred out the cause: we did not update the service map when a service updated
serviceId := p.hashName.StrToNum(serviceName)
sk.ServiceId = serviceId
// if service has exist, just need update frontend port info
if err = p.bpf.ServiceLookup(&sk, &sv); err == nil {
// update: delete then store
if err = p.deleteFrontendData(serviceId); err != nil {
log.Errorf("deleteFrontendData failed: %s", err)
return err
}
if err = p.storeServiceFrontendData(serviceId, service); err != nil {
log.Errorf("storeServiceFrontendData failed, err:%s", err)
return err
}
}
@bfforever Can you help fix it
@bfforever Can you help fix it
okay.
/assign
When waypoint is deleted, we do not update the service(which uses the waypoint) map either
Reproduced now:
steps:
- create a waypoint for svc
- test it work as expected
- delete the gateway, waypoint deleted
- create the gateway, waypoint created
- test service access again, Now i can see from the bpf tracelog, the traffic is still routed to the old waypoint svc
python-226991 [000] d...1 3387970.549843: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.165.244:9080] python-226991 [000] d...1 3387970.549851: bpf_trace_printk: [KMESH] DEBUG: bpf find frontend addr=[10.96.165.244:9080] python-226991 [000] d...1 3387970.549854: bpf_trace_printk: [KMESH] DEBUG: origin addr=[10.96.119.44:15008] // This is the stale service clusters ip
But with the newly added dump, the userspace waypoint is already updated
k exec -ti kmesh-ptwtd -n kmesh-system -- curl 127.0.0.1:15200/debug/config_dump/workload { "name": "reviews", "namespace": "default", "hostname": "reviews.default.svc.cluster.local", "vips": [ "/10.96.165.244" ], "ports": [ { "service_port": 9080, "target_port": 9080 } ], "loadBalancer": null, "waypoint": { "destination": "/10.96.126.207" } },
Could you provide me an example, how waypoint proxy for a certain service, because I only find waypoint used for a certain ServiceAccount or namespace. Currently I can not reproduce your situation.
I did test it with istio 1.22, and the usage of waypoint has changed https://istio.io/latest/docs/ambient/usage/waypoint/#configure-a-service-to-use-a-specific-waypoint
I figurred out the cause: we did not update the service map when a service updated
serviceId := p.hashName.StrToNum(serviceName) sk.ServiceId = serviceId // if service has exist, just need update frontend port info if err = p.bpf.ServiceLookup(&sk, &sv); err == nil { // update: delete then store if err = p.deleteFrontendData(serviceId); err != nil { log.Errorf("deleteFrontendData failed: %s", err) return err } if err = p.storeServiceFrontendData(serviceId, service); err != nil { log.Errorf("storeServiceFrontendData failed, err:%s", err) return err } }
serviceId := p.hashName.StrToNum(serviceName)
If hash conflict occurs, serviceId may be different from the old one.
As discussed, we need a stable str -> id
conversion algorithm.
Not the hash issue, our scale is not that large.
What happened:
Traffic broken after waypoint deleted
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment: