Better support for binary data

grafana / k6

A modern load testing tool, using Go and JavaScript - https://k6.io

GNU Affero General Public License v3.0

25.84k stars 1.27k forks source link

Better support for binary data #1020

Closed na-- closed 3 years ago

na-- commented 5 years ago

We currently don't have a good way to deal with binary data on the JS side yet :disappointed:. We rely on []byte from the Go code, which translates to int arrays in JS, which is far from optimal. These are the places where we currently have binary data that I can think of:

open("somefile", "b")
HTTP response bodies when the request had { responseType: "binary" } as params
in crypto.randomBytes() (https://github.com/loadimpact/k6/pull/922)
in the crypto hashing and digest functions (https://github.com/loadimpact/k6/pull/952)
in the new PKI crypto functions (https://github.com/loadimpact/k6/issues/900)
base64 encoding and decoding

We need a native analogue of the node.js Buffer, or better yet - to properly support ES ArrayBuffer objects. We actually have ArrayBuffer, the code below compiles and runs as you'd expect:

export default function () {
    var buffer = new ArrayBuffer(16);

    // Create a couple of views
    var view1 = new DataView(buffer);
    var view2 = new DataView(buffer, 12, 4); //from byte 12 for the next 4 bytes
    view1.setInt8(12, 42); // put 42 in slot 12

    console.log(view2.getInt8(0));

    var z = new Int32Array(buffer, 0, 4);
    console.log(z);
};

but I'm not sure if this ArrayBuffer comes from the bundled core.js or from goja. It seems like the ArrayBuffer is from goja (though I wouldn't rely on it very much), but the DataView and Int32Array seem to be from core.js :confused:

So we need to investigate exactly what the situation is and what we can use to handle binary data sensibly...

na-- commented 5 years ago

A user found out that the current websocket API lacks support for binary messages: https://community.k6.io/t/converting-audio-file-into-bytes-and-send-over-web-socket/272/

But in order to fix the websocket code, it would be best if we first standardized how we support binary data in general, i.e. this issue... So, because of that and all of the other dependent issues, I'm upping the priority of this issue.

imiric commented 4 years ago

I took a brief look at this, and the current ArrayBuffer implementation comes from core.js, as do DataView and all typed arrays. The Goja implementation is actually disabled, and doesn't seem to have ever been used. (You can now confirm this easily with --compatibility-mode base. ;)

As suggested in #420 and dop251/goja#51, the way forward seems to be having native support in Goja, which would mean resurrecting the currently unused ArrayBuffer implementation, and ensuring it works transparently with core.js polyfills.

It doesn't seem like a gargantuan amount of work, but I'm a bit out of my depths here, as Goja internals are quite complex. I can give this a shot if you agree with the approach, let me know.

caseylucas commented 4 years ago

I found this issue after needing to post some binary data. I ended up making a change to k6 that would allow me to use Uint8Array in the js code but it's a little hacky. Just wanted to share here in case anyone else needs it or it sparks any other ideas. Basically, I just look at the body parameter and if it quacks like an Uint8Array, I call the goja Get func to get all the bytes and put them in a byte[].

diff --git a/js/modules/k6/http/request.go b/js/modules/k6/http/request.go
index d448c3a0..07fec21f 100644
--- a/js/modules/k6/http/request.go
+++ b/js/modules/k6/http/request.go
@@ -29,6 +29,7 @@ import (
        "net/textproto"
        "net/url"
        "reflect"
+       "strconv"
        "strings"
        "sync"
        "time"
@@ -97,7 +98,24 @@ func (h *HTTP) Request(ctx context.Context, method string, url goja.Value, args
        var params goja.Value

        if len(args) > 0 {
-               body = args[0].Export()
+               // check to see if body param looks like a Uint8Array and if so, copy bytes out of it into a []byte.
+               if obj, ok := args[0].(*goja.Object); ok {
+                       bytesPerElementVal := obj.Get("BYTES_PER_ELEMENT")
+                       byteLengthVal := obj.Get("byteLength")
+                       if bytesPerElementVal != nil &&
+                               bytesPerElementVal.ToInteger() == 1 &&
+                               byteLengthVal != nil {
+                               byteLength := byteLengthVal.ToInteger()
+                               byteBuf := make([]byte, byteLength)
+                               for i := int64(0); i < byteLength; i++ {
+                                       byteBuf[i] = (byte)(obj.Get(strconv.FormatInt(i, 10)).ToInteger() & 0xff)
+                               }
+                               body = byteBuf
+                       }
+               }
+               if body == nil {
+                       body = args[0].Export()
+               }
        }
        if len(args) > 1 {
                params = args[1]

With this change, you can use like:

      const binaryBuf = new Uint8Array(4);
      binaryBuf[0] = 0;
      binaryBuf[1] = 1;
      binaryBuf[2] = 2;
      binaryBuf[3] = 3;
      http.post(
        someURL,
        binaryBuf,
      );

caseylucas commented 4 years ago

@imiric Did you have any luck with the goja ArrayBuffer implementation? Is this still the preferred strategy vs something like I did above? Need any help?

imiric commented 4 years ago

@caseylucas No, I didn't make much progress with exposing the native ArrayBuffer. Our top priority right now is getting #1007 merged, so while this issue is high on the priority list, it's unlikely to be worked on for a few weeks at least.

Your approach looks interesting, though I'm curious about the performance, and it should probably be done outside of HTTP.Request, as k6 would benefit from a more generic and lower-level implementation. Pull requests are always welcome, so if you think this approach could work for the other issues mentioned here (open(), crypto, etc.), feel free to contribute. :)

sirianni commented 4 years ago

Is there a workaround that can be used currently, even if it is inefficient? I am using protobufjs which encodes protobuf messages as a Uint8Array. When I try to pass this as a request body I get this error. It seems that the golang http code has support for byte[], but I can't figure out how to pass that from the javascript side.

ERRO[0000] GoError: unknown request body type []interface {}
    at native

na-- commented 4 years ago

Prompted by this forum topic, it might not be a bad idea if we also support "generator functions" when we have binary data. So, instead of http.post(url, someBigChunkOfData), users should be able to use http.post(url, someJSFunctionThatGeneratesDataOnDemand)...

mstoykov commented 3 years ago

Things that I found out trying to use TypedArrays to fix #1571 / #1382 :

TypedArrays can't extend their lenght ... you can just make new array with more length and copy the others in the new one. This probably is okay for some use cases, the nodejs Buffer seems to have the same problem but probably there is some kind of way to not copy bytes around ... constantly. We probably will need to implement something that you can append (at least) byte arrays to and at the end either use it as a body or be able to export it once to an TypedArray.
TypedArrays aren't recognized by http.request and are instead exported as an empty object(map[string]interface{}). Given the current goja exported types this will probably require some hacks such as the one in https://github.com/loadimpact/k6/issues/1020#issuecomment-558268847 to find out we have TypedArray and get it's buffer which is ArrayBuffer.
ArrayBuffers can be exported to by goja fairly type safe and we can get the bytes out of it. Using it currently though leads to unknown type by http.request, but that is fairly easy to fix.
open returns []byte which becomes objectGoSliceReflect this object can be changed, but because we don't export *[]byte it can't be extended :(. So it also couldn't be used for solving the above issues by just adding data to it. And even if you have big enough to only change the contents and then slice it ... that won't keep it as []byte but instead []interface{} and then you will get GoError: unknown request body type []interface {}, so even tha hack won't work :(.

IMO we should concentrate on fixing 3 + 2 for v0.30.0(this will probably take more time testing than actually writing) and possibly 1 for v0.31.0. 4 IMO should be rewritten and have both readonly and not readonly response (the readonly should not copy that for each VU) and should probably return ArrayBuffer.

This might also need changes in goja or at least more APIs so I would argue on any kind of issues (detecting typedArrays) should probably be reported upstream as the maintainer has always been willing to help if not with actual code changes, then with at least bertter documentation of aspects that they know.

na-- commented 3 years ago

Why do we need to fix 2? :confused: I don't think we need to have a way to recognize and work with a TypedArray, i.e. Int8Array, Uint8Array, Uint8ClampedArray, Int16Array, Uint16Array, Int32Array, Uint32Array, Float32Array, Float64Array, BigInt64Array, BigUint64Array. These are all different means to work with the underlying binary data, and I think k6 should only care about that binary data, i.e. the underlying ArrayBuffer that users can easily reference with TypedArray.prototype.buffer.

So, I think everything in k6 that work with binary data should just recognize 3 things - string, Go []byte (for backwards compatibility) and ArrayBuffer, though the HTTP request body handling will probably have to also keep some of the current magic for URL-encoding data as well... Everything else should probably throw an exception - we shouldn't try to guess something is a typed array or some other obscure data structure.

And regarding the extension of such arrays - I may be missing something, but I don't see a reason why we should do anything special in k6? Why doesn't this suffice: https://stackoverflow.com/questions/18600895/resize-arraybuffer

mstoykov commented 3 years ago

The problem with 2 is that

var s = new Uint8Array(20);
// do something
http.post(url, s);

will just ... not work .. more accurately will send empty body (currently). In order for it to work you need to

use s.buffer without any feedback, possibly just because you read the documentation that you need to that , but it won't actually tell you it isn't doing what you intented.
k6 can know that if it gets a typedArray it should just get .buffer. In order for there to be any kind of feedback from k6 we will likely need the same thing so I am for just for making it work without additional need for documentation and specific exception "you should call .buffer on this typedArray instead of giving us the typeArray" ...

And regarding the extension of such arrays - I may be missing something, but I don't see a reason why we should do anything special in k6? Why doesn't this suffice: https://stackoverflow.com/questions/18600895/resize-arraybuffer

That works ... but it means that if you want to make a request with form-data out of 10 files (for example) you will copy HELL of a lot of bytes around each time ... For no good reason. Also again I propose that whatever solution we have for this ... is after we fix 3 and 2 as, without those, that totally doesn't matter.

Also as I have mentioned (at least internally) k6 currently does an unreasonable amount of coping when it uploads files, which is something that probably should looked into around implementing any kind of Buffer like structure that won't just copy bytes around while it's being construcuted.

na-- commented 3 years ago

use s.buffer without any feedback, possibly just because you read the documentation that you need to that , but it won't actually tell you it isn't doing what you intented.

I don't see the problem here - the error message can be that only string and ArrayBuffer types are accepted as the request body, and the error message code can lead to a docs page that has examples how various typed arrays can be used.

That works ... but it means that if you want to make a request with form-data out of 10 files (for example) you will copy HELL of a lot of bytes around each time ... For no good reason. Also again I propose that whatever solution we have for this ... is after we fix 3 and 2 as, without those, that totally doesn't matter.

I may be missing something, but even if you extend a Go slice, you still have to copy the actual old slice data, so I'm not sure how that's different :confused: And, again, at this point I don't see a need to implement 2 (recognition and direct support of TypedArray).

Also as I have mentioned (at least internally) k6 currently does an unreasonable amount of coping when it uploads files, which is something that probably should looked into around implementing any kind of Buffer like structure that won't just copy bytes around while it's being construcuted.

This I agree has to be diagnosed and fixed.

na-- commented 3 years ago

Here's a request for supporting ArrayBuffers in the encoding built-in k6 module: https://community.k6.io/t/base64-encode-arraybuffer/1152

mstoykov commented 3 years ago

I don't think I will have time for anything, but the most MVP of PRs for v0.30.0 ... and to be honest there is probably more work on documenting and testing it than anything else.

imiric commented 3 years ago

As of 41275ab5 k6 now accepts ArrayBuffer as argument to all functions that previously accepted []byte, in k6/crypto, k6/encoding, and k6/http modules.

The leftover parts of this issue are the potentially backwards incompatible changes we should discuss:

returning ArrayBuffer from open(). This was done without a breaking change in #1800 by adding a new 'ab' mode to open(), but it was reverted during review.
returning ArrayBuffer in HTTP response bodies. This was also done without a breaking change in #1800 by adding a new responseType: 'arrayBuffer' option, and was also reverted.
returning ArrayBuffer from crypto.randomBytes(). This was a breaking change in #1800, and was reverted.
returning ArrayBuffer from encoding.b64decode(). This wasn't implemented in #1800, but we mentioned the possibility of making it a breaking change.

We should make a decision on whether we want to make breaking changes to support these, or if it would be better to add ArrayBuffer support with a new argument where possible.

The open() and binary HTTP response bodies are the two most common usages of []byte, and I reckon that the vast majority of users won't notice a difference if they're just passing the value to e.g. http.post() or crypto.sha256(). They'll run into issues if they were accessing or modifying the value by index, in which case they'll have to do so via Typed Arrays instead.

For crypto.randomBytes() I don't think there's a backwards path we can take, save for forcing a new optional argument that we'd deprecate soon anyway. But it's likely much less widely used, and unless it was being accessed by index it's the same situation as above.

We could make the encoding change backwards compatible with a new argument, since it's likely that the current string-only usage was sufficient for most users, even though it should be possible to return both an encoded and decoded ArrayBuffer value.

So maybe the API should be:

encoding.b64encode(input, [encoding], [format])
encoding.b64decode(input, [encoding], [format])

Where format could be 'b' to return ArrayBuffer or empty/unspecified to return string as it does now, to keep it consistent with the open() argument.

So tldr; my vote is for making a breaking change for open(), binary HTTP response bodies and crypto.randomBytes(), and a non-breaking change for encoding.

WYT?

na-- commented 3 years ago

So tldr; my vote is for making a breaking change for open(), binary HTTP response bodies and crypto.randomBytes(), and a non-breaking change for encoding.

:+1: I think I share your preferences. I also implore anyone in the k6 community who is watching the issue to comment if they agree or disagree, since it's looking likely that we'll make a bunch of minor breaking changes in k6 v0.31.0.

na-- commented 3 years ago

Actually, do we need a new format parameter for encoding.b64encode()? Isn't the idea of base64 to transform potentially binary data to a safe string? So we can just type assert the input for ArrayBuffer or string, but we probably should always return a string.

But yeah, encoding.b64decode() probably should have a format parameter, so users can specify whether they want string or ArrayBuffer as the result. Though with string being the default, I imagine some users will probably get mangled binary data, not realizing that the conversion of binary data to JS's UTF-16 may be lossy... Always returning ArrayBuffer will be somewhat bad UX though, despite the manual conversion to string being a one-liner: String.fromCharCode.apply(null, new Uint16Array(buf)) :man_shrugging:

mstoykov commented 3 years ago

Where format could be 'b' to return ArrayBuffer or empty/unspecified to return string as it does now, to keep it consistent with the open() argument.

I don't think it makes sense for b64decode to by default return a string. In general, base64 is used in order to transport binary data as a string, so I am for it returning an ArrayBuffer by default. I am fine with it having an option to return a string or to have the one-liner documented well.

I would expect a lot more users (will) have the problem of mangled decoded binary as string than the need to do anything with what is decoded but to either send it or compare it.

imiric commented 3 years ago

Actually, do we need a new format parameter for encoding.b64encode()? Isn't the idea of base64 to transform potentially binary data to a safe string?

That's the main use case, yes, but the main Encode() function in encoding/base64 works with []byte and EncodeToString() is just a wrapper around it. So presumably someone would want to work with the binary encoded data and stringify it when done, though that's probably a very rare use case. So I'm fine with not introducing format to b64encode(), though the discrepancy of adding it to b64decode() bothers me a bit. :smile:

I agree that the default return value from b64decode() should be binary, but breaking that might impact more users than open(..., 'b'), since they could be relying on string to concatenate the result elsewhere, or do other string manipulation, which wouldn't be common if the script was previously expecting []byte and now it receives ArrayBuffer.

But I can go ahead with the change and we can decide before merging the PR.