grafana / k6

A modern load testing tool, using Go and JavaScript - https://k6.io
GNU Affero General Public License v3.0
25.54k stars 1.26k forks source link

gRPC using reflect race on registering types #3961

Open jason-speck opened 2 weeks ago

jason-speck commented 2 weeks ago

Brief summary

We have tests that use constant-arrival-rate executor against a number of grpc applications. We are using reflection. Most of these run fine. Against some applications, we occasionally receive a panic:

panic: proto: message proto.com.XXXXX.XXXXX.XXXXX.MyService1 is already registered

This is not completely deterministic. However, for applications where it occurs, the probability increases as we increase rate.

As a workaround we are switching these tests to ramping-arrival-rate, but ideally we would like to be able to use constant-arrival-rate.

(I've obfuscated actual host and service names in all output)

k6 version

v0.53.0

OS

Mariner 2.0.20240628

Docker version and image (if applicable)

n/a

Steps to reproduce the problem

version:

$ k6 --version
k6 v0.53.0 (go1.23.1, linux/amd64)
Extensions:
  github.com/grafana/xk6-sql v0.4.1, k6/x/sql [js]
  github.com/mostafa/xk6-kafka v0.27.0, k6/x/kafka [js]
  github.com/thotasrinath/xk6-couchbase v0.0.8, k6/x/couchbase [js]

OS:

$ cat lsb-release 
DISTRIB_ID="Mariner"
DISTRIB_RELEASE="2.0.20240628"
DISTRIB_CODENAME=Mariner
DISTRIB_DESCRIPTION="CBL-Mariner 2.0.20240628"

$ cat mariner-release 
CBL-Mariner 2.0.20240628
MARINER_BUILD_NUMBER=baf1849

$ uname -a
Linux XXXX-XX999.XXX.XXXX.com 5.15.160.1-1.cm2 #1 SMP Fri Jun 28 10:31:37 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Simple test script. See attached for script and output from several runs. Note the panic line may show different grpc endpoints, it doesn't always show the same one (I gather this is related to reflection).

$ /path/to/k6/intel/k6 run   /tmp/XXXXX/k6.js  

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: /tmp/XXXXX/k6.js
        output: -

     scenarios: (100.00%) 1 scenario, 50 max VUs, 35s max duration (incl. graceful stop):
              * constant-qps: 20000.00 iterations/s for 5s (maxVUs: 50, gracefulStop: 30s)

WARN[0000] Insufficient VUs, reached 50 active VUs and cannot initialize more  executor=constant-arrival-rate scenario=constant-qps
panic: proto: message proto.com.XXXX.XXXX.XXXX.SomeServiceName is already registered

Expected behaviour

Test executes.

Actual behaviour

See attached output files. k6.js.txt k6.out.1.txt k6.out.2.txt k6.out.3.txt

mstoykov commented 6 days ago

Hi @jason-speck, thanks for reporting this 🙇

This doesn't really have anything to do with the executors apart from likely making this more common or not.

By the looks of this is due to a race in the registration of types when using reflect

https://github.com/grafana/k6/blob/fc1d5cd109cc8c93f1f77cd7f5fa0f5e7a7b26b1/js/modules/k6/grpc/client.go#L448-L454

While the underlying code

https://github.com/grafana/k6/blob/7e82c3dfc585e7652ea125456c6b0c5003ef2502/vendor/google.golang.org/protobuf/reflect/protoregistry/registry.go#L493-L511

Has mutex it is after we do the Not found check so 2 different VUs doing gRPC requests can get to this point together. And then one will register it holding the lock while the other one will wait and then try to do it again. The latter will get the error you are seeing.

I think that :

  1. we probably shouldn't be using GlobalTypes to begin with
  2. have the lock around the not found check.

cc @olegbespalov

@jason-speck you can probably try the propose from the faq

of running with GOLANG_PROTOBUF_REGISTRATION_CONFLICT=warn k6 ... to not get this as an error.