grafana / k6-operator

An operator for running distributed k6 tests.
Apache License 2.0
597 stars 166 forks source link

Error when run test on 500 pods #408

Open volnyansky opened 6 months ago

volnyansky commented 6 months ago

Brief summary

I'm trying to run the test on 500 pods and get the error : exec /usr/bin/k6: argument list too long
I find a workaround by batching tests in 300 pods packages with the same test id

k6-operator version or image

0.0.14

Helm chart version (if applicable)

k6-operator-3.6.0

TestRun / PrivateLoadZone YAML

apiVersion: k6.io/v1alpha1 kind: TestRun metadata: name: ${USERNAME}-${SCRIPT}-${BATCH} namespace: k6 spec:

number of pods to run in parallel

parallelism: ${BATCH_PODS}
script:
    configMap:
        name: ${USERNAME}-test-script-${BATCH}
        file: test.tar
arguments: -o experimental-prometheus-rw --tag testid=${TESTID}
runner:
    image: 569129334545.dkr.ecr.us-east-1.amazonaws.com/k6-robot-dev:latest
    env:
        -   name: K6_PROMETHEUS_RW_SERVER_URL
            value: "http://victoria-metrics-single-server.monitoring.svc.cluster.local:8428/api/v1/write"
        -   name: K6_PROMETHEUS_RW_TREND_STATS
            value: "count,sum,min,max,avg,med,p(90),p(95),p(99)"
        -   name: K6_BROWSER_ARGS
            value: "window-size=1920x1080,no-sandbox,disable-setuid-sandbox,allow-file-access,use-fake-device-for-media-stream,use-fake-ui-for-media-stream,use-file-for-fake-video-capture=/usr/local/assets/video.mjpeg,use-file-for-fake-audio-capture=/usr/local/assets/audio.wav"
        -   name: K6_BROWSER_TIMEOUT
            value: "45s"
        -   name: VU_ID_START
            value: "${VU_ID_START}"
    nodeSelector:
        engageli.com/role: k6-load-test
    resources:
        limits:
            cpu: "${CPU}"
            memory: ${MEMORY}Mi
        requests:
            cpu: 100m
            memory: ${MEMORY}Mi

Other environment details (if applicable)

No response

Steps to reproduce the problem

Run test on 500pods , number of VUs doesn't matter

Expected behaviour

Tets runs in given number of pods

Actual behaviour

Test crashes

volnyansky commented 6 months ago

It think the issue is in curl max argument length , I see the following in the started pod: curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.232.147:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false} │ │ ,"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.109.212:6565/v1/status -d '{"data":{"attributes":{"p │ │ aused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.60.188:6565/v1/status - │ │ d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10 │ │ .100.188.109:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type │ │ : application/json' http://10.100.112.255:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retr │ │ y 3 -X PATCH -H 'Content-Type: application/json' http://10.100.143.154:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default"," │ │ type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.164.249:6565/v1/status -d '{"data":{"attributes":{"paused":false,"sto │ │ pped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.39.112:6565/v1/status -d '{"data":{"attr │ │ ibutes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.214.183:6565 │ │ /v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/jso │ │ n' http://10.100.199.160:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H ' │ │ Content-Type: application/json' http://10.100.110.241:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}' │ │ ;curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.204.180:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id" │ │ :"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.44.121:6565/v1/status -d '{"data":{"attributes":{"paused" │ │ :false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.33.28:6565/v1/status -d '{"da │ │ ta":{"attributes":{"paused":false,"stopped":false},"id":"default","typ ...

yorugac commented 6 months ago

Hi @volnyansky, this is certainly a new one :sweat_smile: Did the error exec /usr/bin/k6: argument list too long come from the starter pod then?

Out of curiosity, what are you testing that you need such a large test?

On solution. That command is just an iterative concatenation: I guess we could just split it into several commands when there are lots of instances. The question is what kind of values for ARG_MAX can be expected in Kubernetes deployments. The starter command has sequential execution now anyway which is probably not ideal for such a large test as here. But figuring out parallelization for it would definitely be a harder problem.

volnyansky commented 6 months ago

@yorugac I'm running a stress test with a real browser. I need to test not only REST and websocket apis , but also webrtc. So I can't run thousands of robots in one pod.
Yes, I have the issue in started pod

volnyansky commented 6 months ago

Updated it fails on runners too. The most strange thing , that command line is not too long : k6 run │ --quiet │ --execution-segment=7/250:8/250 │ --execution-segment-sequence=0,1/250,2/250,3/250,4/250,5/250,6/250,7/250,8/250,9/250,10/250,11/250,12/250,13/250,14/250,15/250,16/250,17/250,18/250,19/250, 20/250,21/250,22/250,23/250,24/250,25/250,26/250,27/250,28/250,29/250,30/250,31/250,32/250,33/250,34/250,35/250,36/250,37/250,38/250,39/250,40/250,41/250,42/250,43/250,44/250,45/250,46/250,47/250,48/250,49/250,50/250,51/250,52/250,53/250,54/250,55/250,56/250,57/250,58/250,59/250,60/250,61/250,62/250,63/250,64/250,65/250, │ 66/250,67/250,68/250,69/250,70/250,71/250,72/250,73/250,74/250,75/250,76/250,77/250,78/250,79/250,80/250,81/250,82/250,83/250,84/250,85/250,86/250,87/250,88/250,89/250,90/250,91/250,92/250,93/250,94/250,95/250,96/250,97/250,98/250,99/250,100/250,101/250,102/250,103/250,104/250,105/250,106/250,107/250,108/250,109/250,110/250,111/250,112/250,113/250,114/250,115/250,116/250,117/250,118/250,119/250,120/250,121/250,122/250,123/250,124/250,125/250,126/250,127/250,128/250,129/250,130/250,131/250,132/250,133/250,134/250,135/250,136/250,137/250,138/250,139/250,140/250,141/250,142/250,143/250,144/250,145/250,146/250,147/250,148/250,149/250,150/250,151/250,152/250,153/250,154/250,155/250,156/250,157/250,158/250,159/250,160/250,161/250,162/250,163/250,164/250,165/250,166/250,167/250,168/250,169/250,170/250,171/250,172/250,173/250,174/250,175/250,176/250,177/250,178/250,179/250,180/250,181/250,182/250,183/250,184/250,185/250,186/250,187/250,188/250,189/250,190/250,191/250,192/250,193/250,194/250,195/250,196/250,197/250,198/250,199/250,200/250,201/250,202/250,203/250,204/250,205/250,206/250,207/250,208/250,209/250,210/250,211/250,212/250,213/250,214/250,215/250,216/250,217/250,218/250,219/250,220/250,221/250,222/250,223/250,224/250,225/250,226/250,227/250,228/250,229/250,230/250,231/250,232/250,233/250,234/250,235/250,236/250,237/250,238/250,239/250,240/250,241/250,242/250,243/250,244/250,245/250,246/250,247/250,248/250,249/250,1 -o experimental-prometheus-rw --tag testid=stas-browser-mock-login-test-7.5k-2024-06-03-21-07-03 /test/test.tar --address=0.0.0.0:6565 --paused --tag instance_id=8 --tag job_name=stas-browser-mock-login-test-0-8

yorugac commented 6 months ago

it fails on runners too.

@volnyansky, can you please post the full log from one of those runners?

I'm running a stress test with a real browser.

I'm a bit confused by "real browser" part: do you mean the xk6-browser?

volnyansky commented 6 months ago

I'm a bit confused by "real browser" part: do you mean the xk6-browser? - yes, it is xk6. Log contains only one line : exec /usr/bin/k6: argument list too long .

Also I figured out that i need to wait until services left after the previous test are deleted. You code collects IPS from services list which also can lead to overflow.

volnyansky commented 6 months ago

@yorugac I have idea for fix - you can store IPS in env variable(s) as list separated by ; . Then you can iterate over this list in docker start command: `#!/bin/bash

IFS=';' read -ra ARR <<< "$IPS"

for i in "${ARR[@]}"; do

process "$i"

curl -X PATCH "$i"

done`

volnyansky commented 6 months ago

@yorugac I've found final workaround :) I'm running then test in batches and assigning his own namespace per batch. You query k8s list services in your code, so it is possibly return all services in the namespace and not the current test run

yorugac commented 5 months ago

@volnyansky, WDYM by batches? You're not running 500 instances anymore?

it is possibly return all services in the namespace and not the current test run

:thinking: we'd still need to send a "start" command with something like cURL though.

Could you please clarify a bit? :slightly_smiling_face:

volnyansky commented 5 months ago

@yorugac I need to run more than 500 instances, 5000 actually. So I split one test into several and I call them batches. But If all these tests are run in one namespace I still get "argument list too long error", and If I isolate each test in its own namespace I don't get error.

I agree that you still need to send curl, I just proposed a more compact way to call it , to not reach ARG_MAX limit which causes "arguments to long error".

yorugac commented 5 months ago

:thinking: It's strange that namespace is a factor here... If the test is "split" then it's already producing another curl call, even if both tests are in the same namespace. IIUC, the error appears form curl itself and from k6 - not from getting the list of Kubernetes services.

Well, I think it's still about making batches, as described in this comment. Do you happen to have any estimate on what the value of ARG_MAX is? For example, what size of batches work for you?

volnyansky commented 5 months ago

@yorugac In my env ARG_MAX= 131072 bytes

frittentheke commented 3 months ago

If I may kindly point to the discussion about the use of the REST API. I was commenting about switching to doing the "start" command natively and not via some job -> pod and templated curl invocations: https://github.com/grafana/k6-operator/issues/87#issuecomment-2284010897.

It's not only about efficiency, but also about keeping the k6-operator closer in the loop about the state of the runners....