kmesh-net / kmesh

High Performance ServiceMesh Data Plane Based on Programmable Kernel
https://kmesh.net
Apache License 2.0
448 stars 63 forks source link

Maximum Number of Services and Pods Supported #941

Open tmodak27 opened 1 week ago

tmodak27 commented 1 week ago

Motivation: Our production use case requires support for a very large number of services and instances

1. What we did

Environment Details:

We started scaling up in batches of 1000 services using the below yaml file and command.


- scaling up command

$ for i in $(seq 1 1000); do sed "s/foo-service/foo-service-0-$(date +%s-%N)/g" svc.yaml | kubectl apply -f -; done`



**2. What we observed**

At around **16k services** Kmesh started emitting error logs (Attached)
[kmesh_error_logs.txt](https://github.com/user-attachments/files/17344848/kmesh_error_logs.txt)

**3. Why we think this is an issue:** 

For Kmesh to be suitable for our use case,  we need support for a much larger number of services and instances (50K+). 
hzxuzhonghu commented 1 week ago

@lec-bit Did you figure out it?

lec-bit commented 5 days ago

I got the same errors after 16k services, We designed it based on 5000 services and 10w pod. https://github.com/kmesh-net/kmesh/issues/318#issuecomment-2114550669

hzxuzhonghu commented 4 days ago

What error do we first met?

tmodak27 commented 1 day ago

What error do we first met?

kmesh_error_logs.txt

tmodak27 commented 1 day ago

Load Test For Maximum Number of Pods

We performed a load test using pilot-load to verify maximum number of pods.

Observations

  1. After running this load test for 400 pods, Kmesh logs showed invalid next size error.
  2. After running the test with 100 pods, Kmesh did not have any error logs, but the bpf map only had entries of 35 of the 100 pods that were deployed

Environment

Steps To Reproduce Error for 400 pods

  1. Make sure you have Kmesh release-0.5 and Istio running in your cluster
  2. Clone the repo pilot-load. Since Istio is already running in your system, make sure you delete these lines from 42 to 47 from the deploy.sh script.
  3. Follow the steps under Getting Started to set up Pilot Load on your cluster.
  4. Once Pilot Load is set up, create a config map for 400 pods using the below config file and command.
nodeMetadata: {}
jitter:
  workloads: "110ms"
  config: "0s"
namespaces:
- name: foo
  replicas: 1
  applications:
  - name: foo
    replicas: 1 
    instances:  400
nodes:
- name: node
  count: 5
kubectl create configmap config-400-pod -n pilot-load --from-file=config.yaml=svc.yaml --dry-run=client -oyaml | kubectl apply -f -
  1. In the file load-deployment.yaml, set volumes.name.configMap.name to config-400-pod and then kubectl apply -f load-deployment.yaml. It will take 1 to 2 minutes for all the mock pods to get deployed.
  2. Check logs of the Kmesh pod running on your actual Kubernetes node. (Kmesh pods on mock nodes get stuck in pending state, which is expected since these nodes are mocked). You will see the below error.
kmesh-invalid-next-size

Steps to Reproduce Error for 100 pods

  1. Perform steps 1-6 from previous section with config.yaml for 100 pods (step 5).
nodeMetadata: {}
jitter:
  workloads: "110ms"
  config: "0s"
namespaces:
- name: foo
  replicas: 1
  applications:
  - name: foo
    replicas: 1 
    instances:  100
nodes:
- name: node
  count: 3
  1. Kmesh wont have any error logs, but the bpf map will not have all the pods that were deployed (in our test we only got 35).

Edit: Both the above tests were replicated multiple times by changing the deployment order; ie deploying the mock pods first and then Kmesh second. Here is what we observed:

  1. We still got the malloc error for 400 pods,
  2. We still got missing entries in bpf map but the number of missing entries was different in each rerun.
  3. The malloc error is sometimes worded differently. In most cases, we get malloc(): invalid next size (unsorted), but in a few cases we also get malloc(): mismatching next->prev_size (unsorted)

Attachments.

kmesh-invalid-next-size_logs.txt

hzxuzhonghu commented 16 hours ago

@nlgwcy @hzxuzhonghu This seems like a critical bug, can you take some time to look into the root cause