Problems and deficiencies around uploading big files with k6

mstoykov commented 2 years ago

This will discuss deficiencies around net/http, but in reality, some of those apply to other APIs as well.

The idea is mostly to:

Discuss the problems with the current API
Explore what is actually wrong(or not) with it
Figure out which parts are important, and we would like "fixed"
And then possibly discuss solutions, although unless they are simple I would advise having them in separate issues linking to this one.

But let's first get a fairly simple script

import http from 'k6/http';
import { sleep } from 'k6';

const binFile = open('/path/to/file.bin', 'b'); // line A

export default function () {
  const data = {
    field: Math.random(), // dynamic field
    file: http.file(binFile, 'test.bin'), // Line B
  };

  const res = http.post('https://example.com/upload', data); // Line C
  sleep(3);
}

Given this very simple example currently(v0.35.0) we have

A copy of the file.bin created for each VU on line A. This can be addressed with 1931 and PR2303
on Line B there is currently no copy, but as binFile is a mutatable ArrayBuffer it technically can lead to a possibility to edit the underlying data after data is created. And if http.post was asynchronous(and 3 is fixed) possibly even a data race. Even if it was copying and if there wasn't dynamic data this could've been done once in the init context.
on line C data gets "marshaled" into multipart body effectively copying the body.
on line C again res also has a field request which has a field body which is a string dump of the body that was created above.

So for this particular script, there are 1,3 and 4 make copies of potentially big amount of bytes.

Let's go with a similar but slightly different script

import http from 'k6/http';
import { check } from 'k6';
import { FormData } from 'https://jslib.k6.io/formdata/0.0.2/index.js';

const binFile = open('/path/to/file.bin', 'b'); // line A;

export default function () {
  const fd = new FormData();
  fd.append('someTextField', math.Random());
  fd.append('file', http.file(binFile, 'file.bin', 'pdf')); // Line B

  const res = http.post('https://httpbin.test.k6.io/post', fd.body(), { // Line C
    headers: { 'Content-Type': 'multipart/form-data; boundary=' + fd.boundary },
  });
  check(res, {
    'is status 200': (r) => r.status === 200,
  });
}

Here we have the same exact thing except that instead of 3. happening inside k6 it happens in FormData's .body() call which is in javascript. This also happens to be the currently advised way to do multipart requests due to a number of issues.

Arguably just 1931 can help with Line A and ~~as long as the body doesn't need something dynamic that also lets us do New SharedArrayBuffer(name,func() {return arrayBUffer}) can be used to make the whole body in the init context.~~ edit: turns out that I am wrong as at least in the case of FormData you also need the boundary which then requires SharedArrayBuffer to return somehow a buffer and a string (the boundary). While this is doable, I think this now means that we make it extremely specific, so arguably now we need to make it more general. I guess (haven't checked) we can have the FormData take in the boundary and for it to be a constant for each VU 🤷‍♂, but this now seems like we are just fixing a hack with a hack IMO.

This still means that on each request that uploads we make a string copy of the whole body. This body also is IME never used actually. In reality, the only case where it's even relevant is in the first example as it can be used to see what actually k6 sent.

This currently proposed solution is to just reference the original value in all cases but the one where we generate it. This though in practice again falls short by a lot in the case where the values are dynamic and this needs to be generated.

All of this also isn't really thought through from the perspective of having asynchronous calls as well. With asynchronous operations there is no problem to have:

http.postPromise(url, someBody).then(.. something ..)
someBody[23] = 12; // or any other change to the body
http.postPromise(aDifferentUrl, someBody).then(.. something else ..)

In this case there are multiple race conditions points where if someBody gets modified not only can it race with any of the post being done, but response.request.body will not actually be accurate.

I really can't figure out any way for response.request.body to be accurate in cases where we don't want to copy it and it's modifiable. Requesting all body arguments to http.* to be immutable will probably be way too big of an ask and a breaking change.

This combined with the race conditions around asynchronous programming and also that copying big binary blobs will be problematic forever I kind of feel this needs more ... drastic change to fix.

The thing I mostly think of is generators. If the body is given by a generator (or a generator-like object), the body can be read once from it and sent it. There are two problems(apart from no generator support in goja currently):

generators are one-time and for various reasons (redirects and goaway in http2) k6 might need to start reading the body from the beginning so it needs to be something different from a pure JS generator :(.
if it's a pure JS generator the event loop needs to be called for each read from it which might be problematic.

1 can be fixed by having the custom type BodyGenerator with both next() and a reset() methods 🤷‍♂ 2 can be fixed by having custom in-go types that wrap immutable types in ways that will be more optimal

Along the same vein (of bigger changes) we can require that the body is written with explicit write() calls which are blocking or combined with the above.

This likely needs more investigation to see how it works in other APIs such as in nodejs and browsers, but I expect that there this might be less of a problem as there won't be hundreds if not thousands of js VMs doing uploads at the same time. So for them, something like copying might be a lot more reasonable and only have a "streaming" variant for compliance.

Things I have not touched on:

k6 currently (by some amount of design) does not copy ArrayBuffer bytes, but this is very implementation dependent and might get broken by either goja or us at a future point, so this likely is worthy of some tests.
We definitely can just say that race conditions are a script bug and the user should fix them and hope they won't lead to a panic 🤷‍
I haven't talked about CoW structures predominantly as .... those are quite complex and I would expect will have very bad performance in the case where a modification is done
The body currently gets copied if we compress the body, which is by .... uh design (sorry) but is just one more place where http.* will mess up any outside fixes.
I don't even start discussing embedding something like the jslib FormData in k6, which likely will be a requirement for even more performant multipart requests.

In my opinion, while the changes proposed in PR 2303 will have some beneficial effect to the situation with uploading big files in k6, they will not address the underlying problem. Which is that the k6 HTTP module was not designed with uploads in mind (or so it seems to me).

edit: Due to the above problem where even for static FormData a pure SharedArrayBuffer will not be sufficient and that ... I would expect there are even more deficiencies with it, I feel even more strongly that just adding SharedArrayBuffer will have negligible positive effects but will warp any future change to try to work with it instead of actually coming with a better solution.

I would really prefer if instead of making some changes, which may or may not have to be advised against in the future (as we might move to something different entirely), a redesign of the HTTP module is being thought of so that uploading anything is possible in some way that is both safe and performant. Hopefully, also it will be fairly obvious how to use the API in a way that this will be true instead of requiring jumping through many hoops 🤷‍♂

Relevant open at the time issues:

Some of those are only tangibly about uploads(and there are more around the current in k6 multipart support for "object" bodies).

codebien commented 2 years ago

Scope

We want to identify how to pass and process an in-memory Reader (binary or encoded) to the HTTP API for minimizing the percentage of used memory in terms of replicated copies across VUs and per single iteration and maintaining a concurrent-safe logic across VUs and EventLoop.

Even if they are complementary and should be considered from a higher level the following points are not in the scope of this issue to understand how and what is the best solution for:

Making the Reader available from the source (e.g disk or networking fetching, caching).
Use it across other protocols (WebSocket, gRPC, Socket)

Goals

Define an API for making requests with HTTP/1.1 and HTTP/2 protocols
Define the reader’s data type

Upload types to support

Form - www-form-urlencoded (currently the default)
Multipart - FormData
RawBinary
Encoded data - (e.g. JSON - application/json)
Streaming - Is it fine to consider it here?

@MStoykov I tried to summarise what we are trying to resolve in this issue. Do you think is it accurate enough?

About others

Deno - fetch API
NodeJS - Readable

As you mentioned, they are probably designed without memory sharing in mind so I don't know how much they can help. Do we have more relevant to mention and check?

API changes proposal

The thing I mostly think of is generators. If the body is a given by a generator (or a generator like object), the body can be read once from it and send it. There is two problems (apart from no generator support in goja currently)

@MStoykov Should it continue to use the binFile returned from the open function (const binFile = open('/path/to/file.bin', 'b'); // line A)? If not, can you provide more details about your vision here, please?

2 can be fixed with having custom in go types that wrap immutable types in ways that it will be more optimal

Do you mean string and ArrayBuffer? Should the open or equivalent function return directly the wrapped types?

As an alternative of BodyGenerator, I was thinking if we could have something similar to a ReadableStream (or any better if available in JS) with an internal type of:

type StreamType struct {
    chunks []struct{ArrayBuffer, RWLock}
}

It should allow us to lock on chunk-basis so we could potentially accept writes.

So, as a summary of the changes that would be required:

Promise based http.post-like method
FormData type - https://developer.mozilla.org/en-US/docs/Web/API/FormData
A data reader for the body: the BodyGenerator type or an alternative

mstoykov commented 2 years ago

Scope:

As I tried to preface the original post the most important part of this issue is actually discussing the problems and figuring out which are the ones we want fixed, and then we can talk about solutions :).

I would expect we want to fix all of them and I would argue it's better if we have an abstraction that fixes all of them well enough.

@MStoykov I tried to summarise what we are trying to resolve in this issue. Do you think is it accurate enough?

This seems more like a solution searching, while I am not certain, due to lack of comments on this, that we have agreed in my assessment of what the problems are and what we want fixed.

Arguably we might have 10 problems but want to prioritize fixing 7 of them now and fix the remaining 3 later (numbers are just for example). In that case we likely will take a different way of developing this than if we want to fix all of the identified problems.

The whole post was provoked by work around these issues that IMO fixes corner cases somewhat but not the underlying problem. So that we can then discuss the problems and then search for a solution.

About others:

As you mentioned, they are probably designed without memory sharing in mind so I don't know how much they can help. Do we have more relevant to mention and check?

At the time I did not try to figure out a solution as I said before - I was trying to identify problems. There probably are good APIs, maybe even those are good enough. Whether we decided to implement them exactly or just be inspired by them or completely ignore them is a different question. One that I currently don't have an opinion on.

API changes proposal

@MStoykov Should it continue to use the binFile returned from the open function (const binFile = open('/path/to/file.bin', 'b'); // line A)? If not, can you provide more details about your vision here, please?

I don't know again. IMO my main point is that if the work with request bodies (but arguably also response bodies which will be even more involved IMO) require that they are loaded in memory in full, there will be memory problems.

Go's io.Reader is designed (with all of it's additional interfaces) in a way to make more or less this exact thing not a problem - you don't need the whole body you need only parts of it at a time to send it.

The http.Request specifically also has GetBody helper function field which is used in case the body is needed again as is the case in redirects or HTTP/2's GoAway. This is something k6 currently supports as I explained in the previous post. So likely our next solution will have to support it in some way as well.

Ideally while we have built-in solutions for simple(r) problems such as:

uploading 1 file
upload Multipart-formdata
so on it will be nice if a js script can also use those abstractions, so we likely will need abstraction on top of that. Which is where generator came into my thinking as they fit at least the first part. The second part probably will require that we want a function that will return a generator as it happens in the other case.

Also, of note here is that the "we will need the body again" while important problem likely isn't a problem for each API where this will be needed, so we might want to only require it in some places :).

Do you mean string and ArrayBuffer? Should the open or equivalent function return directly the wrapped types?

What I meant by

2 can be fixed with having custom in go types that wrap immutable types in ways that it will be more optimal

Here I was talking about the fact that if you have a generator, to get the next part you need to execute js code and we are in an async code we now need to queue this generator.next() call on the event loop. But for simple(r) cases (listed above) where the "generator" is actually something k6 has created internally not something a user needed to write in js code, we probably can have a version where this is not needed.

Arguably it will be nice if a user can create this in js code, and it can be optimized partially. In go there are a bunch of faster interfaces that io.Reader or io.Writer can implement that then make the whole process of copying that easier. For example io.Copy will use those. On an even more extreme part through the usage of those (at least on linux) you can get to using the sendfile syscall to write directly from disk to a network socket. Is this a requirement for the potential solution? Likely not but it will be really nice if it's possible IMO. So if someone decides to implement (or we do) FormData in js with the new "streamable" API they can at least use sendfile on the files. This though likely means even more design choices and research on the API.

Given just the above few thoughts from the top of my head, I would argue that coming to a solution will take a while and likely will require a lot of changes. I haven't even touched on the fact that httpext likely will need to grow a lot. Or that k6/http API likely will need significant changes as well.

Arguably we should design it with a lot of those ideas in mind, but start implementing only small part of it. But again first we need to agree on the issues and which ones we think are worth fixing. To this end it's probably better to restructure them as a list with bigger examples :thinking: . I did write this issue somewhat in a hurry :(.

p.s. the final example you are giving, seems to still require a copy of the data in memory? so that is what I am kind of against mostly. But I think we will need a lot bigger and more thought out examples going forward, because I am already pretty sure the above (writing of mine) is hard to follow without concrete examples.

mstoykov commented 2 years ago

There seem to be somewhat standard Streams API which seems to be doing somewhat what I was mentioning above. Parts(?) of it seem to be experimental and I don't know how widely used it is. But is basically what is used in the fetch api so probably at least parts of it are fairly well understood and stable.

immavalls commented 1 year ago

Adding this possibly related forum https://community.k6.io/t/getting-error-unexpected-http-error-from-https-ingest-k6-io-v1-archive-upload-413-request-entity-too-large-when-running-test-on-k6-cloud-with-json-files-and-160-users/5983 for reference

grafana / k6