Open mstoykov opened 2 years ago
We want to identify how to pass and process an in-memory Reader (binary or encoded) to the HTTP API for minimizing the percentage of used memory in terms of replicated copies across VUs and per single iteration and maintaining a concurrent-safe logic across VUs and EventLoop.
Even if they are complementary and should be considered from a higher level the following points are not in the scope of this issue to understand how and what is the best solution for:
@MStoykov I tried to summarise what we are trying to resolve in this issue. Do you think is it accurate enough?
As you mentioned, they are probably designed without memory sharing in mind so I don't know how much they can help. Do we have more relevant to mention and check?
The thing I mostly think of is generators. If the body is a given by a generator (or a generator like object), the body can be read once from it and send it. There is two problems (apart from no generator support in goja currently)
@MStoykov Should it continue to use the binFile
returned from the open
function (const binFile = open('/path/to/file.bin', 'b'); // line A
)? If not, can you provide more details about your vision here, please?
2 can be fixed with having custom in go types that wrap immutable types in ways that it will be more optimal
Do you mean string
and ArrayBuffer
? Should the open
or equivalent function return directly the wrapped types?
As an alternative of BodyGenerator
, I was thinking if we could have something similar to a ReadableStream
(or any better if available in JS) with an internal type of:
type StreamType struct {
chunks []struct{ArrayBuffer, RWLock}
}
It should allow us to lock on chunk-basis so we could potentially accept writes.
So, as a summary of the changes that would be required:
http.post
-like methodBodyGenerator
type or an alternativeAs I tried to preface the original post the most important part of this issue is actually discussing the problems and figuring out which are the ones we want fixed, and then we can talk about solutions :).
I would expect we want to fix all of them and I would argue it's better if we have an abstraction that fixes all of them well enough.
@MStoykov I tried to summarise what we are trying to resolve in this issue. Do you think is it accurate enough?
This seems more like a solution searching, while I am not certain, due to lack of comments on this, that we have agreed in my assessment of what the problems are and what we want fixed.
Arguably we might have 10 problems but want to prioritize fixing 7 of them now and fix the remaining 3 later (numbers are just for example). In that case we likely will take a different way of developing this than if we want to fix all of the identified problems.
The whole post was provoked by work around these issues that IMO fixes corner cases somewhat but not the underlying problem. So that we can then discuss the problems and then search for a solution.
As you mentioned, they are probably designed without memory sharing in mind so I don't know how much they can help. Do we have more relevant to mention and check?
At the time I did not try to figure out a solution as I said before - I was trying to identify problems. There probably are good APIs, maybe even those are good enough. Whether we decided to implement them exactly or just be inspired by them or completely ignore them is a different question. One that I currently don't have an opinion on.
@MStoykov Should it continue to use the binFile returned from the open function (const binFile = open('/path/to/file.bin', 'b'); // line A)? If not, can you provide more details about your vision here, please?
I don't know again. IMO my main point is that if the work with request bodies (but arguably also response bodies which will be even more involved IMO) require that they are loaded in memory in full, there will be memory problems.
Go's io.Reader is designed (with all of it's additional interfaces) in a way to make more or less this exact thing not a problem - you don't need the whole body you need only parts of it at a time to send it.
The http.Request specifically also has GetBody
helper function field which is used in case the body is needed again as is the case in redirects or HTTP/2's GoAway. This is something k6 currently supports as I explained in the previous post. So likely our next solution will have to support it in some way as well.
Ideally while we have built-in solutions for simple(r) problems such as:
Also, of note here is that the "we will need the body again" while important problem likely isn't a problem for each API where this will be needed, so we might want to only require it in some places :).
Do you mean string and ArrayBuffer? Should the open or equivalent function return directly the wrapped types?
What I meant by
2 can be fixed with having custom in go types that wrap immutable types in ways that it will be more optimal
Here I was talking about the fact that if you have a generator, to get the next part you need to execute js code and we are in an async code we now need to queue this generator.next()
call on the event loop. But for simple(r) cases (listed above) where the "generator" is actually something k6 has created internally not something a user needed to write in js code, we probably can have a version where this is not needed.
Arguably it will be nice if a user can create this in js code, and it can be optimized partially. In go there are a bunch of faster interfaces that io.Reader
or io.Writer
can implement that then make the whole process of copying that easier. For example io.Copy
will use those. On an even more extreme part through the usage of those (at least on linux) you can get to using the sendfile syscall to write directly from disk to a network socket. Is this a requirement for the potential solution? Likely not but it will be really nice if it's possible IMO. So if someone decides to implement (or we do) FormData in js with the new "streamable" API they can at least use sendfile
on the files. This though likely means even more design choices and research on the API.
Given just the above few thoughts from the top of my head, I would argue that coming to a solution will take a while and likely will require a lot of changes. I haven't even touched on the fact that httpext likely will need to grow a lot. Or that k6/http
API likely will need significant changes as well.
Arguably we should design it with a lot of those ideas in mind, but start implementing only small part of it. But again first we need to agree on the issues and which ones we think are worth fixing. To this end it's probably better to restructure them as a list with bigger examples :thinking: . I did write this issue somewhat in a hurry :(.
p.s. the final example you are giving, seems to still require a copy of the data in memory? so that is what I am kind of against mostly. But I think we will need a lot bigger and more thought out examples going forward, because I am already pretty sure the above (writing of mine) is hard to follow without concrete examples.
There seem to be somewhat standard Streams API which seems to be doing somewhat what I was mentioning above. Parts(?) of it seem to be experimental and I don't know how widely used it is. But is basically what is used in the fetch api so probably at least parts of it are fairly well understood and stable.
This will discuss deficiencies around
net/http
, but in reality, some of those apply to other APIs as well.The idea is mostly to:
But let's first get a fairly simple script
Given this very simple example currently(v0.35.0) we have
file.bin
created for each VU on line A. This can be addressed with 1931 and PR2303binFile
is a mutatable ArrayBuffer it technically can lead to a possibility to edit the underlying data afterdata
is created. And ifhttp.post
was asynchronous(and 3 is fixed) possibly even a data race. Even if it was copying and if there wasn't dynamic data this could've been done once in the init context.data
gets "marshaled" into multipart body effectively copying the body.res
also has a fieldrequest
which has a fieldbody
which is a string dump of the body that was created above.So for this particular script, there are 1,3 and 4 make copies of potentially big amount of bytes.
Let's go with a similar but slightly different script
Here we have the same exact thing except that instead of 3. happening inside k6 it happens in
FormData
's.body()
call which is in javascript. This also happens to be the currently advised way to do multipart requests due to a number of issues.Arguably just 1931 can help with Line A and
as long as the body doesn't need something dynamic that also lets us doedit: turns out that I am wrong as at least in the case of FormData you also need the boundary which then requiresNew SharedArrayBuffer(name,func() {return arrayBUffer})
can be used to make the whole body in the init context.SharedArrayBuffer
to return somehow a buffer and a string (the boundary). While this is doable, I think this now means that we make it extremely specific, so arguably now we need to make it more general. I guess (haven't checked) we can have the FormData take in the boundary and for it to be a constant for each VU 🤷♂, but this now seems like we are just fixing a hack with a hack IMO.This still means that on each request that uploads we make a string copy of the whole body. This body also is IME never used actually. In reality, the only case where it's even relevant is in the first example as it can be used to see what actually k6 sent.
This currently proposed solution is to just reference the original value in all cases but the one where we generate it. This though in practice again falls short by a lot in the case where the values are dynamic and this needs to be generated.
All of this also isn't really thought through from the perspective of having asynchronous calls as well. With asynchronous operations there is no problem to have:
In this case there are multiple race conditions points where if
someBody
gets modified not only can it race with any of the post being done, butresponse.request.body
will not actually be accurate.I really can't figure out any way for
response.request.body
to be accurate in cases where we don't want to copy it and it's modifiable. Requesting allbody
arguments tohttp.*
to be immutable will probably be way too big of an ask and a breaking change.This combined with the race conditions around asynchronous programming and also that copying big binary blobs will be problematic forever I kind of feel this needs more ... drastic change to fix.
The thing I mostly think of is generators. If the body is given by a generator (or a generator-like object), the body can be read once from it and sent it. There are two problems(apart from no generator support in goja currently):
1
can be fixed by having the custom typeBodyGenerator
with bothnext()
and areset()
methods 🤷♂2
can be fixed by having custom in-go types that wrap immutable types in ways that will be more optimalAlong the same vein (of bigger changes) we can require that the body is written with explicit
write()
calls which are blocking or combined with the above.This likely needs more investigation to see how it works in other APIs such as in nodejs and browsers, but I expect that there this might be less of a problem as there won't be hundreds if not thousands of js VMs doing uploads at the same time. So for them, something like copying might be a lot more reasonable and only have a "streaming" variant for compliance.
Things I have not touched on:
http.*
will mess up any outside fixes.In my opinion, while the changes proposed in PR 2303 will have some beneficial effect to the situation with uploading big files in k6, they will not address the underlying problem. Which is that the k6 HTTP module was not designed with uploads in mind (or so it seems to me).
edit: Due to the above problem where even for static FormData a pure
SharedArrayBuffer
will not be sufficient and that ... I would expect there are even more deficiencies with it, I feel even more strongly that just addingSharedArrayBuffer
will have negligible positive effects but will warp any future change to try to work with it instead of actually coming with a better solution.I would really prefer if instead of making some changes, which may or may not have to be advised against in the future (as we might move to something different entirely), a redesign of the HTTP module is being thought of so that uploading anything is possible in some way that is both safe and performant. Hopefully, also it will be fairly obvious how to use the API in a way that this will be true instead of requiring jumping through many hoops 🤷♂
Relevant open at the time issues:
Some of those are only tangibly about uploads(and there are more around the current in k6 multipart support for "object" bodies).