grafana / xk6-disruptor

Extension for injecting faults into k6 tests
https://k6.io/docs/javascript-api/xk6-disruptor/
GNU Affero General Public License v3.0
95 stars 7 forks source link

Investigate delays in gRPC requests when no delay is added by the fault #162

Open pablochacin opened 1 year ago

pablochacin commented 1 year ago

When testing fault injection in the Online Boutique application the requests seems to have a significant delay (around 500ms) even when the faults did not specify any delay (only an error code).

The SUT of the test is a frontend service's HTTP API and the fault injection target was the catalog service, using gRPC fault injection. Therefore, it is not clear whether the delay happened actually in the gRPC requests but repeating the tests several times produced the same behavior.

It would be necessary to check the behavior with a test that access the catalog service directly.

pablochacin commented 1 year ago

This issue seems to be confirmed by the following test: dummy gRPC faults faults (0 error rate and 0 delay) were injected into the catalog service while HTTP requests were applied to the frontend service and gRPC request were applied to the catalog service. The results show that gRPC request have a significant delay when compared to tests that do not inject any fault.

Test with no fault injection:

grpc_req_duration..............: avg=904.77µs min=570.41µs med=849.4µs  max=2.34ms   p(90)=1.07ms   p(95)=1.14ms
http_req_duration..............: avg=52.35ms  min=10ms     med=34.15ms  max=380.53ms p(90)=134.02ms p(95)=186.8ms 

Test with dummy fault injection

grpc_req_duration..............: avg=17.05ms  min=639.76µs med=1.5ms    max=101.2ms  p(90)=64.73ms  p(95)=100.01ms
http_req_duration..............: avg=290.56ms min=996.21µs med=319.16ms max=967.33ms p(90)=567.5ms  p(95)=666.56ms

This can be reproduced following these steps:

  1. Install a kind cluster and configure a load balancer
  2. Install the online boutique application
    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml
  3. Change the type of the product catalog service to LoadBalancer to expose it in the cluster
kubectl  patch svc productcatalogservice  -p '{"spec": {"type": "LoadBalancer"}}'
  1. Set the address for the frontend and product catalog services in environment variables:
FRONTEND_SVC=$(kubectl  get svc frontend-external --output jsonpath='{.status.loadBalancer.ingress[0].ip}')
CATALOG_SVC=$(kubectl  get svc productcatalogservice --output jsonpath='{.status.loadBalancer.ingress[0].ip}')
  1. Run the test with using the environment variable INJECT_FAULTS for toggling the fault injection on and off:
~/go/bin/k6 --env FRONTEND_SVC=$FRONTEND_SVC --env CATALOG_SVC=$CATALOG_SVC:3550 --env INJECT_FAULTS=1 run test-catalog.js

Note: the test requires the protobuf definition for the catalog service

catalog-test.js ``` import http from 'k6/http'; import { check } from 'k6'; import { ServiceDisruptor } from 'k6/x/disruptor' import grpc from 'k6/net/grpc'; const client = new grpc.Client(); client.load([], 'catalog.proto'); export const options = { thresholds: { checks: ['rate>0.99'], }, scenarios: { frontend: { executor: 'constant-arrival-rate', rate: 10, preAllocatedVUs: 5, maxVUs: 100, exec: 'requestFrontend', startTime: '10s', duration: '30s', }, catalog: { executor: 'constant-arrival-rate', rate: 10, preAllocatedVUs: 5, maxVUs: 10, exec: 'requestCatalog', startTime: '10s', duration: '30s', }, inject: { executor: 'shared-iterations', iterations: 1, vus: 1, exec: 'injectFaults', startTime: '0s', } } } export function requestFrontend() { const res = http.get(`http://${__ENV.FRONTEND_SVC}/product/OLJCESPC7Z`); check(res, { 'No errors': (res) => res.status === 200, }); } export function requestCatalog() { client.connect( __ENV.CATALOG_SVC, { plaintext: true, timeout: '1s' } ); const data = { id: 'OLJCESPC7Z' }; const response = client.invoke( 'hipstershop.ProductCatalogService/GetProduct', data, { timeout: '2s' } ); client.close() check(response, { 'status is OK': (r) => r && r.status === grpc.StatusOK, }); } export function injectFaults(data) { if (__ENV.INJECT_FAULTS != "1") { return } const fault = { port: 3550, errorRate: 0.0, statusCode: 10, }; const svcDisruptor = new ServiceDisruptor('productcatalogservice', 'default'); svcDisruptor.injectGrpcFaults(fault, "30s"); } ```
catalog.proto ``` syntax = "proto3"; package hipstershop; // ---------------Product Catalog---------------- service ProductCatalogService { rpc ListProducts(Empty) returns (ListProductsResponse) {} rpc GetProduct(GetProductRequest) returns (Product) {} rpc SearchProducts(SearchProductsRequest) returns (SearchProductsResponse) {} } message Product { string id = 1; string name = 2; string description = 3; string picture = 4; Money price_usd = 5; // Categories such as "clothing" or "kitchen" that can be used to look up // other related products. repeated string categories = 6; } message ListProductsResponse { repeated Product products = 1; } message GetProductRequest { string id = 1; } message SearchProductsRequest { string query = 1; } message SearchProductsResponse { repeated Product results = 1; } message Empty {} message Money { // The 3-letter currency code defined in ISO 4217. string currency_code = 1; // The whole units of the amount. // For example if `currencyCode` is `"USD"`, then 1 unit is one US dollar. int64 units = 2; // Number of nano (10^-9) units of the amount. // The value must be between -999,999,999 and +999,999,999 inclusive. // If `units` is positive, `nanos` must be positive or zero. // If `units` is zero, `nanos` can be positive, zero, or negative. // If `units` is negative, `nanos` must be negative or zero. // For example $-1.75 is represented as `units`=-1 and `nanos`=-750,000,000. int32 nanos = 3; } ```
pablochacin commented 1 year ago

Using the agent instrumentation introduced in #166 it seems the bulk of the latency occurs in the goroutines that handle the connection to/from upstream gRPC service and it is introduced by garbage collection:

image

pablochacin commented 1 year ago

Increasing the memory of the target pods seems to mitigate the GC issue (see image below) but there's still a significant variation in the latency

Without fault injection

grpc_req_duration..............: avg=871.47µs min=457.38µs med=660.8µs  max=25.82ms  p(90)=846.88µs p(95)=1ms 

With dummy fault injection

 grpc_req_duration..............: avg=5.51ms   min=474.6µs  med=1.16ms   max=101.57ms p(90)=15.65ms  p(95)=16.46ms

image