knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.53k stars 1.15k forks source link

Service endpoints are not updated / removed after upgrade to Kubernetes 1.28 #15510

Open mbrancato opened 4 days ago

mbrancato commented 4 days ago

What version of Knative?

0.15.2

Expected Behavior

endpoints should update properly

Actual Behavior

Endpoints for a service are not getting updated on scale down operation or pod deletes. This leaves a lot of incorrect values in the endpoints. The propagates to the public service as well.

% kubectl -n detection get endpoints my-app-00112-private
NAME                      ENDPOINTS                                                              AGE
my-app-00112-private   10.32.101.40:9091,10.32.101.41:9091,10.32.101.43:9091 + 5997 more...   136m

% kubectl -n detection get deploy my-app-00112-deployment
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
my-app-00112-deployment   2/2     2            2           136m

I was able to get logs like this from SKS:

{
apiVersion: "v1"
eventTime: null
involvedObject: {
apiVersion: "networking.internal.knative.dev/v1alpha1"
kind: "ServerlessService"
name: "my-app-00112"
namespace: "detection"
resourceVersion: "6779758389"
uid: "f6ed0598-0171-43ff-bf7a-c45069fdcbe2"
}
kind: "Event"
lastTimestamp: "2024-09-14T15:38:13Z"
message: "SKS: my-app-00112 does not own Service: my-app-00112-private"
metadata: {
creationTimestamp: "2024-09-14T15:38:13Z"
managedFields: [1]
name: "my-app-00112.17f5266fbfda92c2"
namespace: "detection"
resourceVersion: "3317050884"
uid: "20dcc671-4abb-490c-aff8-7404dfdf8063"
}
reason: "InternalError"
reportingComponent: "serverlessservice-controller"
reportingInstance: ""
source: {
component: "serverlessservice-controller"
}
type: "Warning"
}
logName: "projects/my-project-92384924/logs/events"
receiveTimestamp: "2024-09-14T15:38:13.778779952Z"
resource: {
labels: {
cluster_name: "my-cluster-192132"
location: "us-central1-c"
project_id: "my-project-92384924"
}
type: "k8s_cluster"
}
severity: "WARNING"
timestamp: "2024-09-14T15:38:13Z"
}

Steps to Reproduce the Problem

This happens with all our ksvc that scale up and then down or have pods removed (via delete / evict).

mbrancato commented 3 days ago

I'm pretty sure this is an upstream bug, and have opened this: https://github.com/kubernetes/kubernetes/issues/127370

In the SKS update process, it is the private service Endpoints that are feeding SKS. Is there any plan to read from EnpointSlices (stable since 1.21) and move away from the legacy Endpoints? From the docs:

The EndpointSlice API is the recommended replacement for Endpoints.
ReToCode commented 1 day ago

Yepp, seems like the upstream issue, so not much we can do here. For EndpointSlices check the discussion here.