grafana / k6

A modern load testing tool, using Go and JavaScript - https://k6.io
GNU Affero General Public License v3.0
25.04k stars 1.24k forks source link

Implement specialized executor for fault injection #2742

Open pablochacin opened 1 year ago

pablochacin commented 1 year ago

Edit: More context added to the proposal

The xk6-disruptor extension provides functions that allow the injection of faults during the execution of tests. Injecting faults in running tests follows a pattern that is different from the common use case for k6, which consists in generating multiple requests to simulate a workload from one or more users. Injecting faults, on the contrary, require the execution of single events. Each of these events (a fault injection) should have a defined duration.

This is accomplished by using two scenarios: one for running the test and another for injecting the faults as in the example below. The scenario that injects the fault must invoke the fault injection logic only once, therefore a shared iterations executor is used specifying one UV and one iteration.


export function disrupt() {
  // delay traffic from one random replica of the deployment
  const fault = {
      average_delay: 50,
      error_code: 500,
      error_rate: 0.1
  }
  podDisruptor.injectHttpFaults(fault, 60)  // second argument is the duration of the disruption in seconds 
}

export default function(data) {
   http.get(`http://url.to.test.application`);
}

export const options = {
   // This scenario executes the tests
    scenarios: {
        load: {
            executor: 'constant-arrival-rate',
            rate: 100,
            preAllocatedVUs: 10,
            maxVUs: 100,
            exec: "default",
            startTime: '0s',
            duration: "60s",
        },
        // This scenario injects the faults
        disrupt: {
             executor: 'shared-iterations',
             iterations: 1,
             vus: 1,
             exec: "disrupt",
             startTime: "0s",
        }
    }
}

However, this approach has some issues:

  1. It is error prone: the user must be aware that only one VU and one iteration should be specified for the disruption scenario. Specifying multiple iterations and multiple VUs may have unexpected consequences.
  2. The definition of the duration of the disruption (the time the faults will be injected in the target) may result confusing: the shared-iterations executor does not support the duration attribute. Therefore the duration is defined in the invocation to the function that injects the faults (injectHttpFaults in the example below). To make things more confusing, if this duration is longer than 10m, the maxDuration attribute of the scenario must be specified or it will be cancelled while the load scenario (the one injecting load) continues its execution, but without the fault injection, creating misleading results.

To overcome this limitations, instead of reusing existing executor for this particular use case, it would be convenient to have a specialized executor that allows only one iteration in one VU and have no default maxDuration, a duration attribute so it must be explicitly defined if required (most of the time, it should not be required). In this way, following the Poka-yoke philosophy introducing errors will be hard by design.

Following this ideas, the example above would be implemented as shown in the code below (some code omitted for brevity):


export function disrupt() {
  // delay traffic from one random replica of the deployment
  const fault = {
      average_delay: 50,
      error_code: 500,
      error_rate: 0.1
  }
  podDisruptor.injectHttpFaults(fault, 60)  // second argument is the duration of the disruption in seconds 
}

export const options = {
   // This scenario executes the tests
    scenarios: {
        // This scenario injects the faults
        disrupt: {
             executor: 'chaos',
             exec: "disrupt",
             startTime: "0s",
        }
    }
}

Additionally, a duration attribute could be defined in the scenario and made accessible by means of the k6-execution api so that it could used as a default value for the duration of the disruption function. In this way the duration would be more explicit in the definition of the scenario.


export function disrupt() {
  // delay traffic from one random replica of the deployment
  const fault = {
      average_delay: 50,
      error_code: 500,
      error_rate: 0.1
  }
  podDisruptor.injectHttpFaults(fault, exec.scenario.duration)  // second argument is the duration of the disruption in seconds 
}

export const options = {
   // This scenario executes the tests
    scenarios: {
        load: {
        }
        // This scenario injects the faults
        disrupt: {
             executor: 'chaos',
             exec: "disrupt",
             startTime: "0s",
             duration: "5m"
        }
    }
}
na-- commented 1 year ago

specialized executor that allows only one iteration in one VU and have no default maxDuration

Not having a maxDuration is not an option if we want it to run in the cloud. We need scenarios to have a maximum duration because otherwise we can't bill people or know how to allocate resources to run the tests. It is also probably a good thing to have for the upcoming native distributed execution.

When you know how long the disruption will be, why is having a maxDuration of the scenario a problem? And how will this chaos executor be different from a one-VU one-iteration executor?

Or, to put it another way, why isn't this problem solvable with JS code and needs changes in k6 fundamentals? For example, a small JS helper can generate both the scenarios config (for use in options) and return the exec callback, right?

pablochacin commented 1 year ago

When you know how long the disruption will be, why is having a maxDuration of the scenario a problem?

Maybe I did not explain myself correctly. The problem is not having maxDuration, but the confusion in having the duration defined both in the scenario and passed to method that injects the fault and what happens if those two don't match. The real objective is not have this redundancy.

Or, to put it another way, why isn't this problem solvable with JS code and needs changes in k6 fundamentals? For example, a small JS helper can generate both the scenarios config (for use in options) and return the exec callback, right?

Given that every scenario that runs a chaos will have this same issues described above, delegating it to a helper function that must be imported from a js library along with the xk6-disruptor doesn't seem the best developer experience, in my opinion. Not all what is possible is convenient.

However, if I could inject this scenario from the extension itself, that could be a good option, but I'm not sure if that is possible.

That is, if I could call the inject disruption method in the setup function init code and this method creates the necessary configuration of scenarios that would be a much better option. One issue I see is how to pass the function to be executed by the scenario, for instance.

And how will this chaos executor be different from a one-VU one-iteration executor?

This executor will basically be a restricted version of the shared iterations executor that does not allow unsafe parameter combinations by removing those parameters. I don't see it as a "change in k6 fundamentals!.

pablochacin commented 1 year ago

I updated the proposal to clarify that the requirement is not to remove the maxDuration attribute but to make the duration attribute explicit.

pablochacin commented 1 year ago

I updated the proposal for providing more context on the use case that raised this requirement.

sniku commented 1 year ago

I understand that the primary goal the proposal for adding the chaos executor is to provide better defaults for the 1 VU, 1 iteration scenario that must today be specified using the shared-iterations executor. I think the quote "Not all what is possible is convenient." captures the intention well.

A question about the assumptions:

  1. Are we sure the fault injection will always be as simple as in the example where we have 1 iteration? Won't we have some types of disruptions where more faults are injected over time? For example, 1 fault every 30 seconds for 5 minutes.

I'm not a fan of helper functions for creating scenarios because they feel a little hacky, but to illustrate it, this is how it could be implemented today without making core changes.

import { chaosScenario } from 'https://jslib.k6.io/chaosUtils/1.0.0/index.js';

export function disrupt() {
  // delay traffic from one random replica of the deployment
  const fault = {
      average_delay: 50,
      error_code: 500,
      error_rate: 0.1
  }
  podDisruptor.injectHttpFaults(fault, 60)  // second argument is the duration of the disruption in seconds 
}

export const options = {
   // This scenario executes the tests
    scenarios: {
        load: {
            executor: 'constant-arrival-rate',
            rate: 100,
            preAllocatedVUs: 10,
            maxVUs: 100,
            exec: "default",
            startTime: '0s',
            duration: "60s",
        },

        disrupt: chaosScenario({exec: "disrupt"}) // all other params are default but can be overriden
    }
}

The downside of implementing a chaos executor is that it's use-case specific. All other executors are generic and not tied to any single intended use case. Perhaps this is the right approach if we are building a generic reliability platform. I imagine that if we add a chaos executor into the core, users will start using it for unintended purposes making the code confusing. For example, when they want to run the code only once, they use the "chaos" executor. I would be more in favor of naming it only-once rather than chaos.

Another approach is implementing the chaos executor as part of the xk6-disruptor codebase. I believe we have said in the past that creating new executors from extensions is possible, but it needs to be confirmed.

pablochacin commented 1 year ago

The downside of implementing a chaos executor is that it's use-case specific...I would be more in favor of naming it only-once rather than chaos.

Totally agree. I though of using that name, but to be honest, the only use case I'm aware now that would require this only-once execution is chaos.

Are we sure the fault injection will always be as simple as in the example where we have 1 iteration? Won't we have some types of disruptions where more faults are injected over time? For example, 1 fault every 30 seconds for 5 minutes

This is possible, indeed. I can think about some failures like killing pods that may follow this pattern. But I think this can be easily implemented using a loop and a sleep instead of using the fixed rate executor, for example.

What worries me more is the possibility of running more than one fault injection concurrently from multiple VUs.

Maybe, the goal should be to find a mechanism that prevents multiple concurrent executions of the disruptor to happen, but I don't see any, considering a distributed execution environment except using some form of Kubernetes resource or using annotations in the targets, but both are complex to implement reliably.

I'm not a fan of helper functions for creating scenarios because they feel a little hacky, but to illustrate it, this is how it could be implemented today without making core changes.

I was also of this impression, but seeing your proposal, it looks interesting. I'm wondering if this function could be exported from the extension, instead of a JS library. Is there any technical restriction on this regard, like not been able to call functions from extensions from the init code?

na-- commented 1 year ago

I'm wondering if this function could be exported from the extension, instead of a JS library. Is there any technical restriction on this regard, like not been able to call functions from extensions from the init code?

Yes, of course, xk6 JS extensions (and built-in modules) primarily export JS functions that you can import and use in regular JS :sweat_smile: I am unfamiliar with the xk6-disruptor codebase, but here is how the built-in sleep() function is exported: https://github.com/grafana/k6/blob/8a74171c8b43f48fb53078e0bcfa7851828124bc/js/modules/k6/k6.go#L52-L63 https://github.com/grafana/k6/blob/8a74171c8b43f48fb53078e0bcfa7851828124bc/js/modules/k6/k6.go#L71

You can do something like that to export a chaosScenario() function from one of the modules of any extension. For completeness' sake, here is how you should be able to make an executor extension: https://github.com/grafana/k6/blob/8a74171c8b43f48fb53078e0bcfa7851828124bc/lib/executor/shared_iterations.go#L19-L30

It's the same API that is used to register built-in executor types :sweat_smile: To be clear, we make no promises that this API won't change. In fact, it's quite likely to change (https://github.com/grafana/k6/issues/1302), but even after we fix the module structure, there would be no obstacles to providing an official API for executor extensions (https://github.com/grafana/k6/issues/2254). All of that said, I still think a custom executor type would be an overkill here, and the problem should be better solved some other way.

For example, if the current way to specify scenarios and executor options seems to error-prone and unwieldy, we should probably design a better way to specify them instead of having a bunch of semi-duplicate executor types? The goal of the different executors is to express different ways of scheduling work (iterations) among a set of workers (VUs). Where there is some precedent for overlapping executors (e.g. constant-vus is a special case of ramping-vus without any ramping, same for constant-arrival-rate and ramping-arrival-rate...), having a special executor for just 1-VU, 1-iteration scenarios seems a bit too specific.

Besides, this (and other similar things) seem like a completely plausible use cases that someone might want to test for:

Are we sure the fault injection will always be as simple as in the example where we have 1 iteration? Won't we have some types of disruptions where more faults are injected over time? For example, 1 fault every 30 seconds for 5 minutes.

But I think this can be easily implemented using a loop and a sleep instead of using the fixed rate executor, for example.

Implementing it with a for-loop would potentially introduce inaccuracies and delays from the time the operation takes to complete, unless you carefully account for that. And accounting for it would be certainly more complicated than configuring an executor even with the current scenarios config.

What worries me more is the possibility of running more than one fault injection concurrently from multiple VUs.

Maybe, the goal should be to find a mechanism that prevents multiple concurrent executions of the disruptor to happen, but I don't see any, considering a distributed execution environment except using some form of Kubernetes resource or using annotations in the targets, but both are complex to implement reliably.

I don't understand the issues here, sorry. If you have a scenario with a single VU and a single iteration (or 1 iteration every 30s or whatever), with execution segments k6 will never schedule more than 1 VU, regardless of how many instances you have.

pablochacin commented 1 year ago

Hi @na-- Thanks your your extensive explanation. Really useful. Just a few comments/clarifications

Yes, of course, xk6 JS extensions (and built-in modules) primarily export JS functions that you can import and use in regular JS sweat_smile I am unfamiliar with the xk6-disruptor codebase, but here is how the built-in sleep() function is exported:

My question was more in the line of any restriction about using the functions exported by an extension in the init code. From your explanation, it seems to me that there is none, so I will explore this way.

I don't understand the issues here, sorry. If you have a scenario with a single VU and a single iteration (or 1 iteration every 30s or whatever), with execution segments k6 will never schedule more than 1 VU, regardless of how many instances you have.

That is right. If there is only one VU multiple iterations should be ok. My comment was regarding to the risk of opening the use of any scenario configuration because there is this strict restriction of only one VU at a time.

having a special executor for just 1-VU, 1-iteration scenarios seems a bit too specific.

There was also another requirement regarding the duration of each iteration. The Shared iterations executor does not allow this parameter. But I think this requirement can be also managed by generating the scenario using the helper function.

if the current way to specify scenarios and executor options seems to error-prone and unwieldy, we should probably design a better way to specify them instead of having a bunch of semi-duplicate executor types?

I don't think there's an issue with the executors from the perspective of generating load. As I mentioned before, I think the "problem" I was facing comes from the fact that faults are discrete events that occur at specific times during the execution of a test (generally, only once) and this seemed to me different from generating a workload.

But after this discussion I think the solution is more in the side of providing a different UX in the disruptor than changing the way k6 schedules work.

na-- commented 1 year ago

From your explanation, it seems to me that there is none, so I will explore this way.

Yes, there is no restriction. We have to impose that restriction artificially, when we want some function to be used only in the init context. For example: https://github.com/grafana/k6/blob/8a74171c8b43f48fb53078e0bcfa7851828124bc/js/modules/k6/data/data.go#L66-L71

Or only in the VU context: https://github.com/grafana/k6/blob/8a74171c8b43f48fb53078e0bcfa7851828124bc/js/modules/k6/k6.go#L142-L146

As you can see, the current way to ensure either is to check whether the VU State is nil or not.