Open volnyansky opened 6 months ago
It think the issue is in curl max argument length , I see the following in the started pod:
curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.232.147:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false} │ │ ,"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.109.212:6565/v1/status -d '{"data":{"attributes":{"p │ │ aused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.60.188:6565/v1/status - │ │ d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10 │ │ .100.188.109:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type │ │ : application/json' http://10.100.112.255:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retr │ │ y 3 -X PATCH -H 'Content-Type: application/json' http://10.100.143.154:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default"," │ │ type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.164.249:6565/v1/status -d '{"data":{"attributes":{"paused":false,"sto │ │ pped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.39.112:6565/v1/status -d '{"data":{"attr │ │ ibutes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.214.183:6565 │ │ /v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/jso │ │ n' http://10.100.199.160:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H ' │ │ Content-Type: application/json' http://10.100.110.241:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}' │ │ ;curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.204.180:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id" │ │ :"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.44.121:6565/v1/status -d '{"data":{"attributes":{"paused" │ │ :false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.33.28:6565/v1/status -d '{"da │ │ ta":{"attributes":{"paused":false,"stopped":false},"id":"default","typ
...
Hi @volnyansky, this is certainly a new one :sweat_smile: Did the error exec /usr/bin/k6: argument list too long
come from the starter pod then?
Out of curiosity, what are you testing that you need such a large test?
On solution. That command is just an iterative concatenation: I guess we could just split it into several commands when there are lots of instances. The question is what kind of values for ARG_MAX
can be expected in Kubernetes deployments.
The starter command has sequential execution now anyway which is probably not ideal for such a large test as here. But figuring out parallelization for it would definitely be a harder problem.
@yorugac I'm running a stress test with a real browser. I need to test not only REST and websocket apis , but also webrtc. So I can't run thousands of robots in one pod.
Yes, I have the issue in started pod
Updated it fails on runners too. The most strange thing , that command line is not too long : k6 run │ --quiet │ --execution-segment=7/250:8/250 │ --execution-segment-sequence=0,1/250,2/250,3/250,4/250,5/250,6/250,7/250,8/250,9/250,10/250,11/250,12/250,13/250,14/250,15/250,16/250,17/250,18/250,19/250, 20/250,21/250,22/250,23/250,24/250,25/250,26/250,27/250,28/250,29/250,30/250,31/250,32/250,33/250,34/250,35/250,36/250,37/250,38/250,39/250,40/250,41/250,42/250,43/250,44/250,45/250,46/250,47/250,48/250,49/250,50/250,51/250,52/250,53/250,54/250,55/250,56/250,57/250,58/250,59/250,60/250,61/250,62/250,63/250,64/250,65/250, │ 66/250,67/250,68/250,69/250,70/250,71/250,72/250,73/250,74/250,75/250,76/250,77/250,78/250,79/250,80/250,81/250,82/250,83/250,84/250,85/250,86/250,87/250,88/250,89/250,90/250,91/250,92/250,93/250,94/250,95/250,96/250,97/250,98/250,99/250,100/250,101/250,102/250,103/250,104/250,105/250,106/250,107/250,108/250,109/250,110/250,111/250,112/250,113/250,114/250,115/250,116/250,117/250,118/250,119/250,120/250,121/250,122/250,123/250,124/250,125/250,126/250,127/250,128/250,129/250,130/250,131/250,132/250,133/250,134/250,135/250,136/250,137/250,138/250,139/250,140/250,141/250,142/250,143/250,144/250,145/250,146/250,147/250,148/250,149/250,150/250,151/250,152/250,153/250,154/250,155/250,156/250,157/250,158/250,159/250,160/250,161/250,162/250,163/250,164/250,165/250,166/250,167/250,168/250,169/250,170/250,171/250,172/250,173/250,174/250,175/250,176/250,177/250,178/250,179/250,180/250,181/250,182/250,183/250,184/250,185/250,186/250,187/250,188/250,189/250,190/250,191/250,192/250,193/250,194/250,195/250,196/250,197/250,198/250,199/250,200/250,201/250,202/250,203/250,204/250,205/250,206/250,207/250,208/250,209/250,210/250,211/250,212/250,213/250,214/250,215/250,216/250,217/250,218/250,219/250,220/250,221/250,222/250,223/250,224/250,225/250,226/250,227/250,228/250,229/250,230/250,231/250,232/250,233/250,234/250,235/250,236/250,237/250,238/250,239/250,240/250,241/250,242/250,243/250,244/250,245/250,246/250,247/250,248/250,249/250,1 -o experimental-prometheus-rw --tag testid=stas-browser-mock-login-test-7.5k-2024-06-03-21-07-03 /test/test.tar --address=0.0.0.0:6565 --paused --tag instance_id=8 --tag job_name=stas-browser-mock-login-test-0-8
it fails on runners too.
@volnyansky, can you please post the full log from one of those runners?
I'm running a stress test with a real browser.
I'm a bit confused by "real browser" part: do you mean the xk6-browser?
I'm a bit confused by "real browser" part: do you mean the xk6-browser? - yes, it is xk6. Log contains only one line : exec /usr/bin/k6: argument list too long .
Also I figured out that i need to wait until services left after the previous test are deleted. You code collects IPS from services list which also can lead to overflow.
@yorugac I have idea for fix - you can store IPS in env variable(s) as list separated by ; . Then you can iterate over this list in docker start command: `#!/bin/bash
IFS=';' read -ra ARR <<< "$IPS"
for i in "${ARR[@]}"; do
curl -X PATCH "$i"
done`
@yorugac I've found final workaround :) I'm running then test in batches and assigning his own namespace per batch. You query k8s list services in your code, so it is possibly return all services in the namespace and not the current test run
@volnyansky, WDYM by batches? You're not running 500 instances anymore?
it is possibly return all services in the namespace and not the current test run
:thinking: we'd still need to send a "start" command with something like cURL though.
Could you please clarify a bit? :slightly_smiling_face:
@yorugac I need to run more than 500 instances, 5000 actually. So I split one test into several and I call them batches. But If all these tests are run in one namespace I still get "argument list too long error", and If I isolate each test in its own namespace I don't get error.
I agree that you still need to send curl, I just proposed a more compact way to call it , to not reach ARG_MAX limit which causes "arguments to long error".
:thinking: It's strange that namespace is a factor here... If the test is "split" then it's already producing another curl call, even if both tests are in the same namespace. IIUC, the error appears form curl
itself and from k6
- not from getting the list of Kubernetes services.
Well, I think it's still about making batches, as described in this comment. Do you happen to have any estimate on what the value of ARG_MAX
is? For example, what size of batches work for you?
@yorugac In my env ARG_MAX= 131072 bytes
If I may kindly point to the discussion about the use of the REST API. I was commenting about switching to doing the "start" command natively and not via some job -> pod and templated curl invocations: https://github.com/grafana/k6-operator/issues/87#issuecomment-2284010897.
It's not only about efficiency, but also about keeping the k6-operator closer in the loop about the state of the runners....
Brief summary
I'm trying to run the test on 500 pods and get the error : exec /usr/bin/k6: argument list too long
I find a workaround by batching tests in 300 pods packages with the same test id
k6-operator version or image
0.0.14
Helm chart version (if applicable)
k6-operator-3.6.0
TestRun / PrivateLoadZone YAML
apiVersion: k6.io/v1alpha1 kind: TestRun metadata: name: ${USERNAME}-${SCRIPT}-${BATCH} namespace: k6 spec:
number of pods to run in parallel
Other environment details (if applicable)
No response
Steps to reproduce the problem
Run test on 500pods , number of VUs doesn't matter
Expected behaviour
Tets runs in given number of pods
Actual behaviour
Test crashes