SharedArray improvements

mstoykov commented 3 years ago

initialize SharedArray in non init context

There is nothing preventing us from just letting SharedArray be created outside of the init context, I think it was previous not technically possible as it was using a different mechanism to share the data, but this is no longer a problem.

The only caveat is that it should probably disallow creating SharedArray from setup,teardown and handleSummary as ... well the last 2 aren't all that useful and the first one will not work in a distributed setup (unless we decided that SharedArray should be shared between distributed instances, but I think that will be technically ... very hard :tm: )

SharedArray should be returnable from `setup`

This will need some kind of special finding out that whatever was returned is SharedArray first, and then a way to save this in the JSON and reproduce the array on the other side.

This gets even more complicated due to the fact that you might want to return an array of SharedArrays ... which will mean that this might need to support on many levels :sob: , we might decide to not support this at first.

Because of the caveat mentioned above, it probably will need some ... specific code ?!?

Alternative: Maybe as suggested in this comment we should just make another breaking change and make the returned setup data always a "SharedObject" or something like that. I would expect that depending on how deep it is "shared" it will either not save memory, or be slower at least to some degree

Use cases:

When a lot of data will be generated in setup (for example because of needing to use HTTP request to get the data) and then it will be nice to not get this data copied for each VU but instead have it as a single copy.

Making the SharedArray makeable in the default function also can help that as a single VU only will call the initialize code ... once. The problem is that setup might still need to do some setupping and there is no way to return this data to teardown if it's needed there.

edit: This is likely going to wait for #140 in order to be doable.

na-- commented 3 years ago

Being able to make a SharedArray in VU code would also be very helpful for this use case: https://github.com/k6io/k6/issues/1638#issuecomment-768107236

And I agree returning a SharedArray from setup() will be quite hard. I'm not sure if that's what we should do, but if we do it, we should also probably whitelist a few other JS types that currently get mangled because of the JSON encoding and decoding, like ArrayBuffer, typed arrays, built-in k6 types, etc. :man_shrugging:

na-- commented 2 years ago

This forum topic is another example where the first part of this issue would have been useful. With the new JS module APIs, implementing that basically boils down to deleting these 3 lines, right? https://github.com/grafana/k6/blob/c4b88f59189705dfa4d97fec0057e202104d5db1/js/modules/k6/data/data.go#L89-L91

If we remove this (now artificial) restriction, everything else should "just work", I think... :thinking: It should allow users to create a SharedArray in one scenario and then use it in another scenario with a later startTime. Sort of a delayed init / setup() alternative that can make HTTP requests and save the results in a SharedArray so they can be used in other scenarios on the same k6 instance.

There are two potential pitfalls that I can think of:

Returning a SharedArray in setup will probably currently fail, as mentioned above. The proper solution for this would be fully implementing the second part of the issue above. I personally prefer the proposed approach of a SharedObject-like implementation. However, if that is too difficult, as the first iteration I think we can simply implement the json.Marshaler interface by the current SharedArray object, which should transform SharedArray objects returned from setup() into normal JS arrays. In the MarshalJSON method we can also emit a warning (until we properly implement this) that doing so is probably a bad idea, since it completely negates any benefits from SharedArray.
A SharedArray created during an iteration won't be properly shared among multiple k6 instances (e.g. in a k6 cloud or a k6-operator test run). I have some ideas of how to properly solve this in the long term, however in the short term we can just emit a warning that explains the problem when a SharedArray is created in VU code and the execution segment of the instance is not 100%.

In any case, I think it's worth it to remove the artificial restriction around SharedArray usage in VU code. There are certain pitfalls, true, but it will unlock a lot of use cases that are currently not possible to cover by k6...

codebien commented 2 years ago

of a SharedObject-like

Does it mean a k6's wrapped object that supports safe concurrency?

as the first iteration I think we can simply implement the json.Marshaler interface by the current SharedArray object

The SharedArray under the hood is a DynamicArray, do we have a way for implementing the interface on it or it would require a wrap? https://github.com/grafana/k6/blob/c4b88f59189705dfa4d97fec0057e202104d5db1/js/modules/k6/data/share.go#L43-L54

Will the JSON.stringify function trigger the warning? Would we have a workaround for it? :thinking:

const data = new SharedArray('some name', function () {
});

export function setup() {
    console.log(JSON.stringify(data)) // I guess it shouldn't emit the warning
}

Making the SharedArray makeable in the default function also can help that as a single VU only will call the initialize code ... once.

It seems a more distributed-friendly solution, if it hasn't critical problems, why not encourage users to use it until we will have a complete working solution for the setup() function?

na-- commented 2 years ago

Does it mean a k6's wrapped object that supports safe concurrency? The SharedArray under the hood is a DynamicArray, do we have a way for implementing the interface on it or it would require a wrap?

Yes, something like a Proxy that allows for the dynamic sharing of arbitrary nested data, the bulk of which has only one copy in memory and when parts of it are requested, they are copied on demand to the VU that requested them. Basically, a more powerful version of SharedArray.

Will the JSON.stringify function trigger the warning? Would we have a workaround for it?

Yes, why not? JSON.stringify()-ing the current SharedArrays is also probably a mistake, so no reason not to print a warning :man_shrugging:

Perzach commented 1 year ago

Allowing the initialization of SharedArray within the setup context would solve an important use case for my organization. Alternatively we could also solve it if we were allowed to make http requests within the SharedArray function that is run during in the init context.

Our use case is that we have a large file (~100MB) that we would like to share across many distributed k8s pods running our test. We were hoping to put it on s3 and download it in the test. However we find that we are not allowed to fetch the s3 object within the SharedArray initialization because http requests are not allowed in the init context. Furthermore if we download the s3 object in the setup context, we are unable to initialize a SharedArray because this can only be done in the init context. If we put the file contents in the data block that is returned by the setup function our understanding is that memory usage will grow quickly as we increase number of VUs.

guillermotti commented 1 year ago

Hi @Perzach! We had the same problem a few months ago and we found a workaround that maybe will fit for you as well. Let me explain it a bit:

We have a single repository where each team in my company could write their tests scripts. Those scripts and the large files with real data to trigger load tests are generated in a GitHub Action and copied to an EFS volume via running a pod. It's a reusable job so you could use it directly from our repo if you have self-hosted runners. And here is the manifest for the pod able to copy script tests and large files to the volume:

apiVersion: v1
kind: Pod
metadata:
  name: k6-pvc-copy
spec:
  volumes:
    - name: k6-pvc-copy-storage
      persistentVolumeClaim:
        claimName: k6-runner-tests
  containers:
    - name: k6-pvc-copy-container
      image: nginx
      ports:
        - containerPort: 80
          name: "http-server"
      volumeMounts:
        - mountPath: "/test"
          name: k6-pvc-copy-storage
      resources:
        requests:
          cpu: "100m"
          memory: "100Mi"
        limits:
          cpu: "100m"
          memory: "100Mi"

Then the pods running the test can mount the EFS volume and launch the tests using those large files already there in the volume. Here you have an example of what we have (in this case is a template for Argo Workflows but the K6 manifest is the same, you just need to change the inputs):

apiVersion: k6.io/v1alpha1
kind: K6
metadata:
  name: k6-{{`{{inputs.parameters.team}}`}}
spec:
  parallelism: {{`{{inputs.parameters.parallelism}}`}}
  script:
    volumeClaim:
      name: k6-runner-tests
      file: {{`{{inputs.parameters.team}}`}}/test/{{`{{inputs.parameters.scriptName}}`}}
  arguments: --out prometheus=namespace=k6
  ports:
  - containerPort: 5656
    name: metrics
  runner:
    image: ACCOUNT.dkr.ecr.REGION.amazonaws.com/REPO/IMAGE
    resources:
      requests:
        cpu: {{`{{inputs.parameters.cpuRequests}}`}}
        memory: {{`{{inputs.parameters.memoryRequests}}`}}
      limits:
        cpu: {{`{{inputs.parameters.cpuLimits}}`}}
        memory: {{`{{inputs.parameters.memoryLimits}}`}}

Don't hesitate to ask for more details. Hope it helps!

Perzach commented 1 year ago

Thanks @guillermotti for the feedback. Are you saying that if we use the script.volumeClaim syntax to mount the test script, we are inherently getting access to all the other files that are coexisting in the relevant volume within the pod running the test?

If so, this was greatly helpful and we'll give it a shot.

For reference, we will be reading the file in our k6 script somewhat like this:

const assets = new SharedArray('assets', () => {
  const assetFile = open('<path-to-file>');
  return papaparse.parse(assetFile, { header: true }).data;
})

guillermotti commented 1 year ago

Yes, if you are using the k6-operator, the volumeClaim allows you to mount the script file and any other file in the volume is also available for the script. Your code should work.

Here is a piece of code of our scripts which is working:

const QUERIES = new SharedArray("queries", function () {
  console.info(
      "Getting queries from path: " + ROOT_PATH + LOCAL_PATH + '/queries.json');
  return JSON.parse(open(ROOT_PATH + LOCAL_PATH + '/queries.json'));
})[0];

We are opening in that way several files with more than 2MB each so I think it will work for bigger files.

Perzach commented 1 year ago

Thank you, this was very helpful!

na-- commented 1 year ago

Issues like https://github.com/grafana/k6/issues/2911 and https://github.com/grafana/k6/issues/2962 have been making me think recently if we can't make a SharedArray improvement that might partially solve this issue for 80+% of users :thinking: What if we allow networked code in the SharedArray constructor callback? That is, have something like this:

import { SharedArray } from 'k6/data';

// no network code allowed here

const data = new SharedArray('some name', function () {
    // we allow networked code here
    let resp = http.get("https://some.url/that-returns-a-ton-of-data")
    return resp.json();
});

// no network code allowed here

Implementing this will probably be quite tricky and might require some refactoring of how we initialize the JS runtimes, but it seems doable :thinking: For example, it's probably fine to drop any metrics generated from these network calls, it might even be desirable to do so... On the other hand, the script options would not have been finalized yet, so some APIs might behave strangely... :disappointed: It's very far from ideal from an UX perspective, but if it can be done, it should significantly lessen the need to support SharedArray in setup() or the VU context :thinking:

There is also the problem of distributed execution and .tar archives. If we allow networking code in the SharedArray callbacks, are we going to execute it on every instance that runs the .tar archive, or are we going to store the result in the .tar file? :thinking: We get around this issue right now by saving the open()-ed files in the init context in the .tar archive, so all instances are able to reconstruct the final SharedArray value identically. But this might not always be the desired behavior :thinking:

On a somewhat related note, the proof-of-concept architecture I did in https://github.com/grafana/k6/pull/2816 for distributed execution (https://github.com/grafana/k6/issues/140) and test suites (https://github.com/grafana/k6/issues/1342) would also allow us to relatively easily have a SharedArray constructed during setup() or VU code, the original suggestion proposed in the OP of this issue. I didn't make a PoC for that, but it should be possible to make one quickly... :thinking: One nice thing about this approach will be the fact that the SharedArray constructor will be run only on a single machine, regardless of whether it was in the init context, VU context or even inside setup()...

It will to that by basically making setup() not that special, i.e. by adding a mechanism for shared blobs of binary data that any one instance can create and consume: https://github.com/grafana/k6/blob/49a2e27972a95f0ab18a4fb10fc406a36a75fa07/execution/controller.go#L5-L7

That is used for the setup() function in that branch to work in a distributed test run, but it can be used equally easily for SharedArray with a bit of simple optimization: https://github.com/grafana/k6/blob/49a2e27972a95f0ab18a4fb10fc406a36a75fa07/execution/scheduler.go#L443-L457

Still, while I think that approach is more internally consistent and easier to reason about, both approaches are not necessarily mutually exclusive, with a bit of work and thought :thinking: If allowing network requests in the init context SharedArray callbacks is easy, maybe we can first do that. We can either save their results in the .tar archives, or maybe we can have a PerInstanceSharedArray whose explicitly gets executed once on every instance :thinking: Or, we can "canonize" that the SharedArray callback is called on every instance (since that's what currently happens) and we can add a new GlobalSharedArray that doesn't do that, if we want to, in the future... :man_shrugging: There are a lot of subtle issues and potential solutions that we have to consider before we make anything non-experimental... :scream:

So, yeah, to conclude this somewhat rambling stream of thoughts, more proofs-of-concept are needed here :sweat_smile: I am probably missing quite a lot of critical details, e.g. it might be completely impossible to allow init network calls only in SharedArray callbacks without rewriting half of k6... :sweat_smile:

grafana / k6