numaproj / numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
https://numaflow.numaproj.io/
Apache License 2.0
1.1k stars 112 forks source link

Endurance testing for batch map #1817

Closed kohlisid closed 3 months ago

kohlisid commented 3 months ago
Screenshot 2024-07-16 at 8 37 38 AM

Running an endurance test on a pipeline with constant load of 50k for 5 days Attached is the read rate for the vertex

kohlisid commented 3 months ago

Pipeline spec used

apiVersion: numaflow.numaproj.io/v1alpha1
kind: Pipeline
metadata:
  name: simple-pipeline
spec:
  vertices:
    - name: in
      scale:
        min: 10
        max: 10
      source:
        # A self data generating source
        generator:
          msgSize: 500
          rpu: 5000
          duration: 1s
          value: 100
      containerTemplate:
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
          requests:
            cpu: "2"
            memory: 4Gi
    - name: batch-cat
      metadata:
        annotations:
          numaflow.numaproj.io/batch-map: "true"
      partitions: 12
      scale:
        min: 18
        max: 18
      udf:
        container:
          image: quay.io/kohlisid/numaflow-go/batch-map-flatmap:test1
        resources:
          limits:
            cpu: "4"
            memory: 16Gi
          requests:
            cpu: "2"
            memory: 8Gi
      containerTemplate:
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
          requests:
            cpu: "2"
            memory: 4Gi
    - name: out
      partitions: 12
      scale:
        min: 6
        max: 6
      sink:
        # A simple log printing sink
        blackhole: {}
      containerTemplate:
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
          requests:
            cpu: "2"
            memory: 4Gi
  edges:
    - from: in
      to: batch-cat
    - from: batch-cat
      to: out
kohlisid commented 3 months ago
kubectl top pods

simple-pipeline-batch-cat-0-0mcfa         1191m        73Mi            
simple-pipeline-batch-cat-1-swdiw         1147m        69Mi            
simple-pipeline-batch-cat-10-4yt10        1154m        73Mi            
simple-pipeline-batch-cat-11-omegc        1039m        69Mi            
simple-pipeline-batch-cat-12-riuav        1201m        74Mi            
simple-pipeline-batch-cat-13-ebhjb        1131m        69Mi            
simple-pipeline-batch-cat-14-x1jm8        1208m        67Mi            
simple-pipeline-batch-cat-15-0onio        1152m        68Mi            
simple-pipeline-batch-cat-16-vyegw        1066m        76Mi            
simple-pipeline-batch-cat-17-wpmkn        1062m        71Mi            
simple-pipeline-batch-cat-2-o8yxk         1064m        71Mi            
simple-pipeline-batch-cat-3-032nh         1041m        70Mi            
simple-pipeline-batch-cat-4-juiek         1095m        72Mi            
simple-pipeline-batch-cat-5-8ckf1         1082m        66Mi            
simple-pipeline-batch-cat-6-mbqdv         1101m        69Mi            
simple-pipeline-batch-cat-7-f6xtg         1131m        69Mi            
simple-pipeline-batch-cat-8-jgjli         1150m        68Mi            
simple-pipeline-batch-cat-9-yxyf9         981m         66Mi            
simple-pipeline-daemon-854ff49886-f2pfg   181m         58Mi            
simple-pipeline-in-0-fdyh6                550m         32Mi            
simple-pipeline-in-1-xgzj3                527m         34Mi            
simple-pipeline-in-2-fezzd                508m         33Mi            
simple-pipeline-in-3-pnavp                471m         34Mi            
simple-pipeline-in-4-jiwm8                503m         32Mi            
simple-pipeline-in-5-trjds                473m         32Mi            
simple-pipeline-in-6-rmhql                454m         34Mi            
simple-pipeline-in-7-hf0ll                536m         34Mi            
simple-pipeline-in-8-zxc3s                442m         33Mi            
simple-pipeline-in-9-ea3dl                485m         35Mi            
simple-pipeline-out-0-npy4r               1524m        66Mi            
simple-pipeline-out-1-uwkuh               1601m        58Mi            
simple-pipeline-out-2-reeaw               1446m        63Mi            
simple-pipeline-out-3-y16bh               1539m        66Mi            
simple-pipeline-out-4-hincx               1541m        65Mi            
simple-pipeline-out-5-sjzpr               1466m        63Mi 
kohlisid commented 3 months ago
Screenshot 2024-07-16 at 8 44 41 AM
kohlisid commented 3 months ago

UDF processing time

Screenshot 2024-07-16 at 8 48 43 AM
kohlisid commented 3 months ago

@vigith @whynowy What other details would you like to attach here? cc @numaproj/numaflow-dev

KeranYang commented 3 months ago

why was there a spike on Monday?

kohlisid commented 3 months ago

@KeranYang Pod migrations on the cluster,

"reason":"EvictionByEvictionAPI","message":"Eviction API: evicting"

No restarts or errors seen in the pods

simple-pipeline-batch-cat-0-0mcfa         2/2     Running   0          32h
simple-pipeline-batch-cat-1-swdiw         2/2     Running   0          28h
simple-pipeline-batch-cat-10-4yt10        2/2     Running   0          30h
simple-pipeline-batch-cat-11-omegc        2/2     Running   0          31h
simple-pipeline-batch-cat-12-riuav        2/2     Running   0          32h
simple-pipeline-batch-cat-13-ebhjb        2/2     Running   0          28h
simple-pipeline-batch-cat-14-x1jm8        2/2     Running   0          31h
simple-pipeline-batch-cat-15-0onio        2/2     Running   0          33h
simple-pipeline-batch-cat-16-vyegw        2/2     Running   0          28h
simple-pipeline-batch-cat-17-wpmkn        2/2     Running   0          28h
simple-pipeline-batch-cat-2-o8yxk         2/2     Running   0          30h
simple-pipeline-batch-cat-3-032nh         2/2     Running   0          29h
simple-pipeline-batch-cat-4-juiek         2/2     Running   0          32h
simple-pipeline-batch-cat-5-8ckf1         2/2     Running   0          28h
simple-pipeline-batch-cat-6-mbqdv         2/2     Running   0          29h
simple-pipeline-batch-cat-7-f6xtg         2/2     Running   0          28h
simple-pipeline-batch-cat-8-jgjli         2/2     Running   0          31h
simple-pipeline-batch-cat-9-yxyf9         2/2     Running   0          29h
simple-pipeline-daemon-854ff49886-f2pfg   1/1     Running   0          32h
simple-pipeline-in-0-fdyh6                1/1     Running   0          32h
simple-pipeline-in-1-xgzj3                1/1     Running   0          30h
simple-pipeline-in-2-fezzd                1/1     Running   0          28h
simple-pipeline-in-3-pnavp                1/1     Running   0          33h
simple-pipeline-in-4-jiwm8                1/1     Running   0          29h
simple-pipeline-in-5-trjds                1/1     Running   0          28h
simple-pipeline-in-6-rmhql                1/1     Running   0          29h
simple-pipeline-in-7-hf0ll                1/1     Running   0          27h
simple-pipeline-in-8-zxc3s                1/1     Running   0          28h
simple-pipeline-in-9-ea3dl                1/1     Running   0          30h
simple-pipeline-out-0-npy4r               1/1     Running   0          32h
simple-pipeline-out-1-uwkuh               1/1     Running   0          30h
simple-pipeline-out-2-reeaw               1/1     Running   0          28h
simple-pipeline-out-3-y16bh               1/1     Running   0          33h
simple-pipeline-out-4-hincx               1/1     Running   0          31h
simple-pipeline-out-5-sjzpr               1/1     Running   0          28h
vigith commented 3 months ago

Good, no containers were ever restarted :)

@kohlisid, please paste the ISB spec for reference, too.

kohlisid commented 3 months ago

ISB Spec

apiVersion: numaflow.numaproj.io/v1alpha1
kind: InterStepBufferService
metadata:
  name: default
spec:
  jetstream:
    version: 2.10.11
    startArgs:
    replicas: 3
    persistence:
      storageClassName: gp3
      accessMode: ReadWriteOnce
      volumeSize: 40Gi
    containerTemplate:
      resources:
        limits:
          memory: 16384Mi
        requests:
          cpu: 8
          memory: 16384Mi
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: isbsvc
                  numaflow.numaproj.io/isbsvc-name: fci-session
              topologyKey: topology.kubernetes.io/zone
            weight: 100
kohlisid commented 3 months ago

All green on the endurance test!