jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.26k stars 2.42k forks source link

[Bug]: race condition when reloading TLS certificates #4316

Open yurishkuro opened 1 year ago

yurishkuro commented 1 year ago

What happened?

This is a continuation of this discussion: https://github.com/jaegertracing/jaeger/pull/4260#discussion_r1141124965

Steps to reproduce

We can probably reproduce it by creating a unit test (which will run with -race) that does the following:

Expected behavior

We expect that changes to certificates in a server take effect on future client connections. Instead, the test described above should detect a race condition as we try to update the same certPools that are already being used by the server (and therefore will be read on new client connections).

Relevant log output

No response

Screenshot

No response

Additional context

The rough proposal would be as follows:

NB: this probably would not address the issue of reloading root CA for the server or the client, perhaps we should not be even supporting it. GetConfigForClient would only allow supporting reloading of ClientCA on the server, per @tsaarni 's comment.

Jaeger backend version

No response

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

No response

Deployment configs

No response

ChillOrb commented 1 year ago

@yurishkuro I am trying to reproduce the error by writing a test case.

Is this the right approach to identify the problem you are referring in the issue?

  1. Can i create self signed server certificates and keys and store them as temp files inside func generateNewCertificateFiles()
  2. Setup tls config and spin up a server
  3. In another goroutine ,in a loop, use generateNewCertificateFiles() to create new certs and reload certs for local server.
  4. In another loop try to make get calls to local server.
yurishkuro commented 1 year ago

Yes, and you can do this all in a unit test, such that if run with -race flag Go should reliably detect race condition.

For step 1 you don't need to generate new certs, we already have working pairs as text fixtures in the tlscfg package.

ChillOrb commented 1 year ago

@yurishkuro sorry got busy. I am having trouble in reproducing the race condition 1. this is the test function that i came up with to reproduce the error

  1. I am trying to create option with existing test certs in tlscfg package link
  2. Here i start tls server in background with option above link
  3. In another goroutine , generating new certificates using gen-certs.sh and generate new server config by calling cfg, err := options.Config(zap.NewNop()) then i update the server config server.TLSConfig = cfg this causes another race condition.
  4. In another go routine , crating client with options and getting new config , also making Get calls to background

Not able to reproduce race conditions at

WARNING: DATA RACE
Read at 0x00c0000105e8 by goroutine 22:
  crypto/x509.(*CertPool).addCertFunc()
      /opt/hostedtoolcache/go/1.20.2/x64/src/crypto/x509/cert_pool.go:189 +0x6b8
  crypto/x509.(*CertPool).AppendCertsFromPEM()
      /opt/hostedtoolcache/go/1.20.2/x64/src/crypto/x509/cert_pool.go:227 +0x4f1
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.addCertToPool()
      /home/runner/work/jaeger/jaeger/pkg/config/tlscfg/options.go:142 +0x1d7

as mentioned in comment I want to know if this approach to test is in the right direction ?

I will try to spend some more time on this

shashank-iitbhu commented 7 months ago

Hey @yurishkuro ! I was trying to reproduce this issue. So I came up with two tests, first test TestConcurrentCertPoolAccessForDataRace link directly provoking a data race by concurrently appending certificates to an x509.CertPool while the second test TestConcurrentConfigAccess link concurrently invokes a method Options.Config that loads and update certificate pools indirectly.

shashank-iitbhu commented 7 months ago

Both the test are provoking Data Race conditions. Data Race provoked by the first test TestConcurrentCertPoolAccessForDataRace link is similar to the actual Data Race:


==================
WARNING: DATA RACE
Read at 0x00c00012f7a0 by goroutine 8:
  runtime.makeBucketArray()
      /usr/local/go/src/runtime/map.go:346 +0x21c
  crypto/x509.(*CertPool).addCertFunc()
      /usr/local/go/src/crypto/x509/cert_pool.go:189 +0x55c
  crypto/x509.(*CertPool).AppendCertsFromPEM()
      /usr/local/go/src/crypto/x509/cert_pool.go:227 +0x404
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func1()
      /Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:204 +0xa0
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func2()
      /Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:212 +0x44

Previous write at 0x00c00012f7a0 by goroutine 16:
  runtime.mapaccessK()
      /usr/local/go/src/runtime/map.go:519 +0x1ec
  crypto/x509.(*CertPool).addCertFunc()
      /usr/local/go/src/crypto/x509/cert_pool.go:193 +0x594
  crypto/x509.(*CertPool).AppendCertsFromPEM()
      /usr/local/go/src/crypto/x509/cert_pool.go:227 +0x404
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func1()
      /Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:204 +0xa0
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func2()
      /Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:212 +0x44
==================```
yurishkuro commented 7 months ago

Does it mean your tests reliably reproduce data race? If so, great job! Can you create a PR adding those tests? And since you're deep into this, do you see a way to fix this?