I've spent the past couple of weeks working with K6 trying to improve a test case where we have to send FormData-type requests to an API with a 'large' image (~140kb) and I've gone from peaking at ~30RPS with 40VU to >1000RPS with 30VU.
The issue largely seems to be around using Uint8Arrays and copying many to one 🤔
Preemptive apologies if I've misunderstood or misrepresented any inner workings of K6.
Problem
We have some requirements for testing against this particular endpoint
This is in contrast to ~800RPS with a 4kb image with the same VU count.
100% CPU
Grafana Cloud was angry and sad, throwing warnings around CPU usage
Memory never went about 5%
Observations
Initially I thought we might be CPU bound on IO operations with open being in the init part of the lifecycle, however this was not the case.
Uint8Array and buffers in general are very poor performing in K6
FormData polyfill essentially converts every entry/field to a byte array, then populates a new Uint8Array
This phase is where CPU spikes to 100%
SharedArray only accepting strings, numbers, and other primitives but not buffers or any kind was very difficult to work with
Memory usage was never an issue with this
Road to low CPU, high RPS
I experimented with a number of things that didn't yield better results, most of which were before I inspected the FormData polyfill and found the expensive operations
SharedArray
A number of attempts were made to forward load binary data and store it here in various forms to try satisfy the supported types
The experimental fs APIs
Forward generating / building as much as possible in the init and setup lifecycles.
At this point I explored the inner workings of the FormData polyfill and found the way body builds the request body to be a primary point of suspicion, body(), is reasonably expensive.
Consider the case where we have two "parts", and calling body():
Declare body as an empty array
Prebuild a byte array for the RFC7578 compliant data boundary, boundary
Stringified JSON part jsonStr
Push the boundary to body
Concatenate string composing the Content-Disposition, cd
Convert cd to a byte array
Push cd to body
Convert jsonStr to a byte array
Push jsonStr to body
Convert \r\n to a byte array
Push \r\n to body
Repeat for the most part for an image
Push a closing boundary to body
Create a new Uint8Array from the body array of byte arrays
This copies all data to a new collection in linear time
Return a reference to the buffer
There's a lot going on here, take a breather, then lets continue.
From here I thought: Well, what can I do to forward calculate a lot of these repeated conversions and try reduce this burden on iterations when running our load tests?
I brought in parts of the FormData polyfill I could reuse, and trimmed what I did not need, and forward calculated the byte arrays where possible in a SharedArray:
const baseFormDataBoundary = '------RWWorkerFormDataBoundary';
const sharedFormData = new SharedArray('fd', function () {
const contentBreak = toByteArr('\r\n');
return [
[...toByteArr(baseFormDataBoundary)], // have to expand these to number arrays as K6 does not like `Uint8Array`s
[
...imgToFormDataByteArr(
new Uint8Array(open('image.jpeg', 'b')),
contentBreak
),
],
[...contentBreak],
];
});
// Trimmed down `toByteArr` as I was already going to handle the binary data cases
function toByteArr(input: string): Uint8Array {
const out = new Uint8Array(input.length);
for (let i = 0; i < input.length; ++i) {
out[i] = input.charCodeAt(i) & 0xff;
}
return out;
}
This meant that when it came to actually running iterations, I had already done a lot of the conversion and was able to simply join them before sending the data off
With this I did get some improvements, but barely. From ~30RPS -> ~100RPS. Far from the >1000RPS I've been aiming for. It was clear, this wasn't going to work, maxing out the CPU usage was still a major problem:
Base64 encoding, manually stitching requests, and hitting 1000RPS
While experimenting with the byte data above we talked about before, and in an effort to store the image data as something that could be shared, I did notice one small but important detail: Base64 encoding and decoding is fast. Very fast.
I ventured down the path of creating base64 encoded strings of the data we need so we can simply concatenate these strings before doing one big decode straight to a single buffer. Simply strategy but there are some challenges with this too, of course.
The Go implementation of base64 decoding does not support decoding concatenated base64 strings that contain base64 padding (=). To work around this, before encoding any strings I simply ensure they're the right length (n % 3 == 0) and pad with spaces, or 0s, where necessary.
const baseFormDataBoundary = '------RWWorkerFormDataBoundary'; // Just needs to be distinct from the body data, per the spec
const sharedFormData: string[] = new SharedArray('fd', function () {
return [
b64encode(imgToFormDataByteArr(new Uint8Array(open('image.jpeg', 'b')))), // image payload
// These two strings are already the right length for b64 with no padding
b64encode(`\r\n--${baseFormDataBoundary}\r\n`), // boundary MUST have NO surrounding whitespace, only newlines
`--${baseFormDataBoundary}--\r\n`, // closing boundary (unencoded)
];
});
And here are some helpers I've written for anyone also embarking on this journey:
function calcPaddedLength(len: number): number {
const remainder = len % 3;
return remainder === 0 ? len : len + 3 - remainder;
}
function padStrToValidB64Len(input: string): string {
const paddedLen = calcPaddedLength(input.length);
if (paddedLen === 0) return input;
return input.padEnd(paddedLen);
}
With the image binary data and boundary data now shared, I brought in the use of the setup lifecycle hook:
There are obviously a lot of details omitted here, you'll have to work on making a compliant request body but the specification is reasonably clear but a small gotcha is that the data boundary cannot contain spaces before or after it, as it is interpreted as a whole line and must match the boundary specified in your header.
With these changes now in play, I'm seeing a significantly more performant load test with the large image payload, exceeding 1000RPS peaks.
I've spent the past couple of weeks working with K6 trying to improve a test case where we have to send FormData-type requests to an API with a 'large' image (~140kb) and I've gone from peaking at ~30RPS with 40VU to >1000RPS with 30VU.
The issue largely seems to be around using
Uint8Array
s and copying many to one 🤔Preemptive apologies if I've misunderstood or misrepresented any inner workings of K6.
Problem
We have some requirements for testing against this particular endpoint
multipart/form-data
FormData
First results
Observations
open
being in theinit
part of the lifecycle, however this was not the case.Uint8Array
and buffers in general are very poor performing in K6FormData
polyfill essentially converts every entry/field to a byte array, then populates a newUint8Array
SharedArray
only accepting strings, numbers, and other primitives but not buffers or any kind was very difficult to work withRoad to low CPU, high RPS
I experimented with a number of things that didn't yield better results, most of which were before I inspected the
FormData
polyfill and found the expensive operationsSharedArray
fs
APIsinit
andsetup
lifecycles.At this point I explored the inner workings of the
FormData
polyfill and found the waybody
builds the request body to be a primary point of suspicion, body(), is reasonably expensive.Consider the case where we have two "parts", and calling
body()
:body
as an empty arrayboundary
jsonStr
boundary
tobody
Content-Disposition
,cd
cd
to a byte arraycd
tobody
jsonStr
to a byte arrayjsonStr
tobody
\r\n
to a byte array\r\n
tobody
body
Uint8Array
from thebody
array of byte arraysThere's a lot going on here, take a breather, then lets continue.
From here I thought: Well, what can I do to forward calculate a lot of these repeated conversions and try reduce this burden on iterations when running our load tests?
I brought in parts of the FormData polyfill I could reuse, and trimmed what I did not need, and forward calculated the byte arrays where possible in a
SharedArray
:This meant that when it came to actually running iterations, I had already done a lot of the conversion and was able to simply join them before sending the data off
With this I did get some improvements, but barely. From ~30RPS -> ~100RPS. Far from the >1000RPS I've been aiming for. It was clear, this wasn't going to work, maxing out the CPU usage was still a major problem:
Base64 encoding, manually stitching requests, and hitting 1000RPS
While experimenting with the byte data above we talked about before, and in an effort to store the image data as something that could be shared, I did notice one small but important detail: Base64 encoding and decoding is fast. Very fast.
So I asked myself "if I simply open an image, and encode it in base64, what does the CPU usage look like?" and the short answer was: "next to nothing". When inspecting the sources for K6, I can see that
k6/encoding
gives us a binding to the Go implementation forencoding/base64
which is perfect.I ventured down the path of creating base64 encoded strings of the data we need so we can simply concatenate these strings before doing one big decode straight to a single buffer. Simply strategy but there are some challenges with this too, of course.
The Go implementation of base64 decoding does not support decoding concatenated base64 strings that contain base64 padding (
=
). To work around this, before encoding any strings I simply ensure they're the right length (n % 3 == 0
) and pad with spaces, or 0s, where necessary.And here are some helpers I've written for anyone also embarking on this journey:
With the image binary data and boundary data now shared, I brought in the use of the
setup
lifecycle hook:Which we can now safely and cleanly consume from the VU code:
There are obviously a lot of details omitted here, you'll have to work on making a compliant request body but the specification is reasonably clear but a small gotcha is that the data boundary cannot contain spaces before or after it, as it is interpreted as a whole line and must match the boundary specified in your header.
With these changes now in play, I'm seeing a significantly more performant load test with the large image payload, exceeding 1000RPS peaks.
And the CPU usage has dropped drastically: