kmesh-net / kmesh

High Performance ServiceMesh Data Plane Based on Programmable Kernel
https://kmesh.net
Apache License 2.0
466 stars 70 forks source link

Kmesh Logs Errors and Crashes After Deploying 165 ServiceEntries #1023

Open tmodak27 opened 2 weeks ago

tmodak27 commented 2 weeks ago

Motivation:

A limit of 165 ServiceEntries seems lower than expected. Our production use case requires support for a very large number of services, service entries and pods

Environment Details:

Kubernetes: 1.28 OS: Openeuler 23.03 Istio: 1.19 Kmesh version: release 0.5 CPU: 8 Memory: 16 Gib

Steps To Reproduce

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: foo-service
  namespace: default
spec:
  hosts:
  - foo-service.somedomain # not used
  addresses:
  - 192.192.192.192/24 # VIPs
  ports:
  - number: 27018
    name: foo-service
    protocol: HTTP
  location: MESH_INTERNAL
  resolution: STATIC
  endpoints: # 1 endpoint per service entry. Adjust depending on your test.
  - address: 2.2.2.2
$ for i in $(seq 1 165); do sed "s/foo-service/foo-service-0-$(date +%s-%N)/g" service-entry.yaml | kubectl apply -f -; done

What was observed

After the number of ServiceEntries hit 165, Kmesh started logging the below error (see attachment) and crashed.

service-entry-error.txt'

Note: After trying this multiple times, sometimes the error message was different malloc(): invalid next size

hzxuzhonghu commented 2 weeks ago

cc @nlgwcy @lec-bit

nlgwcy commented 2 weeks ago

There may be other model limitations. We'll check.

lec-bit commented 2 weeks ago

the same issue with https://github.com/kmesh-net/kmesh/issues/941 the maximum value of inner_map, 1300, so this issue occurred. When we create 163 virtualHosts in one routeConfigs, the array 163*sizeof(ptr) > 1300. This problem can be avoided by manually adjusting the maximum value of inner_map. kmesh.json

tmodak27 commented 2 weeks ago

Maximum Endpoints and Services Supported by Kmesh

After modifying the command to deploy every ServiceEntry on a separate port so that each RouteConfig would have one Virtual Host. Below are the 2 scenarios we tested.

Scenario 1: 1 endpoint (minimum possible) per ServiceEntry

Steps

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: foo-service
  namespace: default
spec:
  hosts:
  - foo-service.somedomain # not used
  addresses:
  - 192.192.192.192/24 # VIPs
  ports:
  - number: 27018
    name: foo-service
    protocol: HTTP
  location: MESH_INTERNAL
  resolution: STATIC
  endpoints: # 1 endpoint per service entry. Adjust depending on your test.
  - address: 2.2.2.2
for i in $(seq 1 1100); do sed  "s/foo-service/foo-service-0-$(date +%s-%N)/g;s/27018/$i/g" service-entry-1.yaml | kubectl apply -f -; done

Results

The below errors are observed when slightly more than 1000 ServiceEntries are deployed.

time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_943 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_48 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_117 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_138 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_383 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_603 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_739 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_786 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_79 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_354 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_591 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_675 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_729 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2
time="2024-11-08T18:41:39Z" level=error msg="listener 0.0.0.0_816 NONE flush failed: ListenerUpdate deserial_update_elem failed" subsys=cache/v2

Why is this an issue ?

Our use case needs to support higher number of endpoints, and this is far lower than the theoretical 100,000 endpoints and 5000 services.

Scenario 2: 150 endpoints (maximum possible) per ServiceEntry

Steps

for i in $(seq  1 600); do sed  "s/foo-service/foo-service-0-$(date +%s-%N)/g;s/27018/$i/g" service-entry-1.yaml | kubectl apply -f -; done

Results

The below errors are observed at approx 500 services (total 75000 endpoints)

error-logs-max-pods.txt

Why is this an issue ?

Our use case needs to support highter number of endpoints, and 75000 endpoints is lower than theoretical 100,000 maximum endpoints and 5000 maximum services.