openservicemesh / osm

Open Service Mesh (OSM) is a lightweight, extensible, cloud native service mesh that allows users to uniformly manage, secure, and get out-of-the-box observability features for highly dynamic microservice environments.
https://openservicemesh.io/
Apache License 2.0
2.59k stars 277 forks source link

Certificate rotation broken in long running environments #5000

Closed shashankram closed 2 years ago

shashankram commented 2 years ago

Bug description: I noticed that in 2 of my setups running for several days, certificate rotation is broken. This leads to expired certificates in Envoy never being rotated, leading to complete traffic disruption between apps.

Log on the client:

{"bytes_received":0,"response_flags":"UF","upstream_service_time":null,"response_code":503,"start_time":"2022-08-15T18:02:56.864Z","authority":"fortio.demo.svc.cluster.local:8080","duration":1,"bytes_sent":195,"protocol":"HTTP/1.1","x_forwarded_for":null,"path":"/","request_id":"26aa1f49-a09b-4b13-81bd-0acfa49852b8","user_agent":"fortio.org/fortio-1.34.1","response_code_details":"upstream_reset_before_response_started{connection_failure,TLS_error:_268435581:SSL_routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED}","time_to_first_byte":null,"requested_server_name":null,"upstream_host":"10.244.1.5:8080","method":"GET","upstream_cluster":"demo/fortio|8080"}

Stat on the client: cluster.demo/fortio|8080.ssl.fail_verify_error: 50

The certs stat osm_bug_report_2928885602/namespaces/demo/pods/fortio-client-b9b7bbfb8-hc9wr/commands/osm_proxy_get_certs_fortio-client-b9b7bbfb8-hc9wr_-n_demo confirms the cert not being updated, with its client cert having an expiration date of 2022-08-12T19:05:52Z. The expiration date should be past 2022-08-15 (current date).

Both the client and server are connected to the controller as per the XDS cluster stats collected in the bug-report.

osm-controller indicates the cert default.demo.svc.cluster.local has expired but has not been rotated:

     Common Name: "default.demo.cluster.local"
     Valid Until: 2022-08-12 19:05:52.0391624 +0000 UTC m=+86504.429431801 (16h45m27.4651171s remaining)
     Issuing CA (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
     Trusted CAs (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
     Cert Chain (SHA256): c4b68f17387579bff64ee155ae7263687c2d3fbea929ad56b0489b6ebd074dd8
     x509.SignatureAlgorithm: SHA256-RSA
     x509.PublicKeyAlgorithm: RSA
     x509.Version: 3
     x509.SerialNumber: 980b07dedb46443624884fcfcd9b9f03
     x509.Issuer: CN=osm-ca.openservicemesh.io,O=Open Service Mesh,L=CA,C=US
     x509.Subject: CN=default.demo.cluster.local,O=Open Service Mesh
     x509.NotBefore (begin): 2022-08-11 19:05:52 +0000 UTC (95h30m32.0984793s ago)
     x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-71h30m32.0984815s remaining)
     x509.BasicConstraintsValid: true
     x509.IsCA: false
     x509.DNSNames: [default.demo.cluster.local]
     Cert struct expiration vs. x509.NotAfter: -39.1624ms
x509.NotBefore (begin): 2022-08-11 19:05:52 +0000 UTC (95h30m32.0984793s ago)
x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-71h30m32.0984815s remaining)

Affected area (please mark with X where applicable):

Expected behavior: Certificates should be rotated in long running environments.

Steps to reproduce the bug (as precisely as possible): I observed this bug twice while executing the demo global rate limit demo over multiple days.

How was OSM installed?:

osm install --set osm.image.registry=$CTR_REGISTRY --set osm.image.tag=$CTR_TAG --set osm.image.pullPolicy=Always --set osm.enablePermissiveTrafficPolicy=true

Bug report archive:

2313848016_osm-bug-report.tar.gz

Environment:

steeling commented 2 years ago

The attached files are all showing that the days_until_expiration is in the future.

I've also setup a local cluster with a 30s service validity duration and have verified that certs are both being rotated and properly propagated to the sidecar

shashankram commented 2 years ago

The attached files are all showing that the days_until_expiration is in the future.

I've also setup a local cluster with a 30s service validity duration and have verified that certs are both being rotated and properly propagated to the sidecar

I don't think the days_until_expiration is accurate. It may have overflowed. The expiration date is in the past as seen below.

     "days_until_expiration": "18446744073709551614",
     "valid_from": "2022-08-11T19:05:52Z",
     "expiration_time": "2022-08-12T19:05:52Z"
steeling commented 2 years ago

Oh interesting, I missed that.

Would you be able to upgrade to debug/trace logs on the controller to get more info as well?

shashankram commented 2 years ago

I have trace logging enabled.

Strangely, none of the cert serial no I see in the logs below match the ones displayed in the debug server's /debug/certs output:

{"level":"trace","component":"certificate","time":"2022-08-15T19:42:07Z","file":"manager.go:249","message":"Certificate found in cache SerialNumber=29871855943509303830645595579115296839"}
{"level":"trace","component":"certificate","time":"2022-08-15T19:42:07Z","file":"manager.go:249","message":"Certificate found in cache SerialNumber=155024194875030960472719792658873311581"}
{"level":"trace","component":"certificate","time":"2022-08-15T19:42:07Z","file":"manager.go:249","message":"Certificate found in cache SerialNumber=202099930258132563591593472965020983043"}
{"level":"trace","component":"certificate","time":"2022-08-15T19:42:12Z","file":"manager.go:249","message":"Certificate found in cache SerialNumber=202099930258132563591593472965020983043"}
{"level":"trace","component":"certificate","time":"2022-08-15T19:42:12Z","file":"manager.go:249","message":"Certificate found in cache SerialNumber=29871855943509303830645595579115296839"}
{"level":"trace","component":"certificate","time":"2022-08-15T19:42:12Z","file":"manager.go:249","message":"Certificate found in cache SerialNumber=155024194875030960472719792658873311581"}
---[ 0 ]---
     Common Name: "ads.cluster.local"
     Valid Until: 2032-08-08 19:04:08.8324488 +0000 UTC m=+315360001.222718401 (87591h38m59.6801654s remaining)
     Issuing CA (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
     Trusted CAs (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
     Cert Chain (SHA256): 9ea1f7a851a3e79137e828309fbf78eed86ab5e61a5c7c3b3a4512467fad92d3
     x509.SignatureAlgorithm: SHA256-RSA
     x509.PublicKeyAlgorithm: RSA
     x509.Version: 3
     x509.SerialNumber: 74a092eb34540b1f1a023005de73e15d
     x509.Issuer: CN=osm-ca.openservicemesh.io,O=Open Service Mesh,L=CA,C=US
     x509.Subject: CN=ads.cluster.local,O=Open Service Mesh
     x509.NotBefore (begin): 2022-08-11 19:04:08 +0000 UTC (96h37m0.6772795s ago)
     x509.NotAfter (end): 2032-08-08 19:04:08 +0000 UTC (87503h22m59.3226904s remaining)
     x509.BasicConstraintsValid: true
     x509.IsCA: false
     x509.DNSNames: [ads.cluster.local]
     Cert struct expiration vs. x509.NotAfter: -832.4488ms

---[ 1 ]---
     Common Name: "default.demo.cluster.local"
     Valid Until: 2022-08-12 19:05:52.0391624 +0000 UTC m=+86504.429431801 (15h40m42.8860671s remaining)
     Issuing CA (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
     Trusted CAs (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
     Cert Chain (SHA256): c4b68f17387579bff64ee155ae7263687c2d3fbea929ad56b0489b6ebd074dd8
     x509.SignatureAlgorithm: SHA256-RSA
     x509.PublicKeyAlgorithm: RSA
     x509.Version: 3
     x509.SerialNumber: 980b07dedb46443624884fcfcd9b9f03
     x509.Issuer: CN=osm-ca.openservicemesh.io,O=Open Service Mesh,L=CA,C=US
     x509.Subject: CN=default.demo.cluster.local,O=Open Service Mesh
     x509.NotBefore (begin): 2022-08-11 19:05:52 +0000 UTC (96h35m16.6775508s ago)
     x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-72h35m16.6775541s remaining)
     x509.BasicConstraintsValid: true
     x509.IsCA: false
     x509.DNSNames: [default.demo.cluster.local]
     Cert struct expiration vs. x509.NotAfter: -39.1624ms

---[ 2 ]---
     Common Name: "osm-validator.osm-system.svc"
     Valid Until: 2032-08-08 19:04:09.4120288 +0000 UTC m=+315360001.802298401 (87591h39m0.2587751s remaining)
     Issuing CA (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
     Trusted CAs (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
     Cert Chain (SHA256): 6091bcc712056568dee59db18f4075fd4c8b2c47e02319370f5d6e253a0f2123
     x509.SignatureAlgorithm: SHA256-RSA
     x509.PublicKeyAlgorithm: RSA
     x509.Version: 3
     x509.SerialNumber: 16791c351fa7d38f6664d772e452d847
     x509.Issuer: CN=osm-ca.openservicemesh.io,O=Open Service Mesh,L=CA,C=US
     x509.Subject: CN=osm-validator.osm-system.svc,O=Open Service Mesh
     x509.NotBefore (begin): 2022-08-11 19:04:09 +0000 UTC (96h36m59.6777024s ago)
     x509.NotAfter (end): 2032-08-08 19:04:09 +0000 UTC (87503h23m0.3222948s remaining)
     x509.BasicConstraintsValid: true
     x509.IsCA: false
     x509.DNSNames: [osm-validator.osm-system.svc]
     Cert struct expiration vs. x509.NotAfter: -412.0288ms
steeling commented 2 years ago

and the logs in the bug report are all INFO logs unfortunately. Are you seeing any logs with the message Certificate has been updated for proxy, which would be at the debug level?

shashankram commented 2 years ago

With the current validity duration (24h), I don't see this log after enabling trace logging.

Interestingly, none of the cert serial no I see in the logs match the ones in the output of /debug/certs. That seems incorrect.

steeling commented 2 years ago

so it seems the certs in the cache have been rotated, but somewhere there's been a disconnect to propagate that to the proxy. Even if we lowered the validity duration, the certs won't get re-rotated until ~30s prior to them expiring.

We could force a rolling restart, but that may make this no longer reproducible.

If for some reason there was an error in creating the SDS/ADS response, we don't have a retry built in (this would not be unique to v1.2), so we would silently rotate the cert in the cache and then not rotate them on the pod, unless another mechanism came along.

This may be another reason to migrate to the Snapshot Cache, but also another reason to imiplement level-based reconciliation (periodic retries vs current event-based)

steeling commented 2 years ago

FWIW I'm unable to reproduce this locally

shashankram commented 2 years ago

I have seen this twice on a long running testbed. It could likely be a race somewhere or an issue with an XDS response erroring with no retries in place like you mentioned. I've also not seen this issue in short lived testbeds.

Restarting the pods will likely result in fixing this, but I didn't want to till we have analyzed all the info from the bug-report.

Also were you able to determine why there is a mismatch between the debug server output and the certs within the cache?

steeling commented 2 years ago

I can't see how that can happen since they're pulling off the same cache, unless your browser is caching some old results?

shashankram commented 2 years ago

I can't see how that can happen since they're pulling off the same cache, unless your browser is caching some old results?

Nope, I am refreshing the API request and see the timestamp change for the problematic cert:

     x509.NotBefore (begin): 2022-08-11 19:05:52 +0000 UTC (97h27m45.3234176s ago)
     x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-73h27m45.3234191s remaining)

This seems to indicate there is some bug where the cert for CN default.demo.cluster.local is never being refreshed, or there are multiple certs corresponding to this CN in the cache and the debugger is picking the expired one, which is also incorrect.

shashankram commented 2 years ago

This could be related to a side-effect of https://github.com/openservicemesh/osm/issues/5002, where a cert that needs to be rotated is not being rotated due to a call to IssueCertificate(), which tries to return a cached copy if possible. Rotated certs must always be issued, and invalidate the existing key for that cert in the cache instead.

jaellio commented 2 years ago

This could be related to a side-effect of #5002, where a cert that needs to be rotated is not being rotated due to a call to IssueCertificate(), which tries to return a cached copy if possible. Rotated certs must always be issued, and invalidate the existing key for that cert in the cache instead.

When IssueCertificate() is called on a certificate that needs to be rotated and is in the cache, shouldRotate() is called on the cert. If the cert needs to be rotated shouldRotate() should return true, and a new cert is issued using the same prefix as the old certificate. The existing entry in the cache for the prefix is updated.

Previously, there was a RotateCertificate method which called IssueCertificate(). In IssueCertificate() we checked if the cert was in the cache and returned the cert if it was in the cache and didn't need to be rotated. So even if the RotateCertificate was called, the cert could still not have been rotated.

I believe this PR captures the previous behavior of rotation https://github.com/openservicemesh/osm/pull/4580. Just want to make sure we understand the key functional changes.

shashankram commented 2 years ago

This could be related to a side-effect of #5002, where a cert that needs to be rotated is not being rotated due to a call to IssueCertificate(), which tries to return a cached copy if possible. Rotated certs must always be issued, and invalidate the existing key for that cert in the cache instead.

When IssueCertificate() is called on a certificate that needs to be rotated and is in the cache, shouldRotate() is called on the cert. If the cert needs to be rotated shouldRotate() should return true, and a new cert is issued using the same prefix as the old certificate. The existing entry in the cache for the prefix is updated.

Previously, there was a RotateCertificate method which called IssueCertificate(). In IssueCertificate() we checked if the cert was in the cache and returned the cert if it was in the cache and didn't need to be rotated. So even if the RotateCertificate was called, the cert could still not have been rotated.

I believe this PR captures the previous behavior of rotation #4580. Just want to make sure we understand the key functional changes.

There's also the addition of singleflight.Group.Do() which is another layer of caching that can result in unintended behavior if there are multiple calls to IssueCertificate() at the same instant, which which case the result from the first call is shared as a response to subsequent calls:

    // a singleflight group is used here to ensure that only one issueCertificate is in
    // flight at a time for a given certificate prefix. Helps avoid a race condition if
    // issueCertificate is called multiple times in a row for the same certificate prefix.
    cert, err, shared := m.group.Do(prefix, func() (interface{}, error) {
        return m.issueCertificate(prefix, ct, opts...)
    })

I'd rather have the rotation be an atomic operation, which checks if a cert needs to be rotated, then issues a new certificate for the CN and invalidates the previous cache entry. Further, this would imply not checking if a certificate should be rotated in the IssueCertificate() function, and letting the rotation handle the flow of issuing a new cert and having clients issue a subsequent Get on the cert through an event.

steeling commented 2 years ago

So agreed the above can be better, but I don't think that is what is causing the issues we are seeing. There's nothing to make me think a race condition can occur here that would perpetuate after a cert has expired (assuming a new Issue cert request is generated, which we have seen in our debugging session).

It also doesn't explain the debugger mismatch.

Anyways, I have some thoughts on a potential fix:

  1. Remove the concept of Rotating certs based on what's in the cache. This is flawed for a couple reasons: a) A cert that lives outside the osm-controller's (or injectors) memory still needs rotation, however if the pod restarts it will no longer be in the cache. This requires tracking these certs externally. b) The cache is virtually unbounded. We never remove any service certs.

  2. Abstract the cache into a new interface behind a Get/Store method. The implementation will only return non-expired certificates (or certificates with some amount of time left on them), and can have a goroutine to remove expired certs to remove the current memory leak.

This poses the following problems:

  1. In the new flow, a request to CertManager won't know the difference between a brand new Issue and a rotation, meaning we need to remove our reliance of the service cert pubsub.

  2. A service cert may expire, and no new events would cause the cert to not be rotated.

We can solve both problems mentioned through the use of something like the Ticker. However more concretely, I recommend we remove the ticker, but add a new timer to the message broker's runProxyUpdateDispatcher that triggers after some duration D in which we have not sent out a proxy update. This ensure's the CertManager's IssueCertificate is called to reinstantiate any expired certs, removes the reliance of the certmanager's service pubsub instance, and removes our memory leak in the cert cache.

steeling commented 2 years ago

If anybody is digging into this tomorrow, I think a good line of investigation would be if the pubsub is getting backed up and with singleflight, preventing new IssueCertificate calls from actually doing anything https://github.com/openservicemesh/osm/issues/4847.

shashankram commented 2 years ago

If anybody is digging into this tomorrow, I think a good line of investigation would be if the pubsub is getting backed up and with singleflight, preventing new IssueCertificate calls from actually doing anything #4847.

I am curious to know if we still need to rely on the pub-sub model for generating Envoy configs with the snapshot cache. I am hoping using the snapshot cache is simple enough such that on an event (or batch of events), simply writing to the cache suffices, vs waking up individual gRPC stream goroutines per proxy via events.

steeling commented 2 years ago

That's a great question. I'd be curious to know how "expensive" it is to build out the configs each time, or if it would be deemed acceptable to continuously reconstruct all of the configs on a loop without needing to know that a k8s event is triggering a new event.

Right now we have the message broker acting as a notification mechanism that something changed and needs to be reconstructed and that immediately gets sent out.

With the snapshot cache we abstract away the "sending out" part (and it would be trivial to switch to the DeltaStream with that as well), but we're still missing the "when to construct configs". We can either do it on a loop, or keep the event based system, or use some combination of the 2.

I don't have too much opinion here. The loop would be much simpler, but might use up a lot more CPU if there are long periods of no k8s events.

The loop also solves our retry mechanism issue in certain scenarios, but we could still use a loop in on a less frequent basis (every 2-5 min), in addition to the notification. Whereas with just the loop, I'd say we need to run it virtually continuously

steeling commented 2 years ago

@shashankram so whether we move to snapshot cache (and remove pubsub, fully leveraging loop based approach), or keep snapshot cache (but keep pubsub + a loop based approach), I think the solution I propose in https://github.com/openservicemesh/osm/issues/5000#issuecomment-1217256850 would suffice. In that by moving the rotation to be implicit upon calls to IssueCertificate, we remove the pubsub altogether (the cause of the issue)

Any objections before moving forward with that?

jaellio commented 2 years ago

Has the root cause been identified for this issue? If so, could it be reiterated here? Does the solution need to be included in a patch release of v1.2?

shashankram commented 2 years ago

@shashankram so whether we move to snapshot cache (and remove pubsub, fully leveraging loop based approach), or keep snapshot cache (but keep pubsub + a loop based approach), I think the solution I propose in #5000 (comment) would suffice. In that by moving the rotation to be implicit upon calls to IssueCertificate, we remove the pubsub altogether (the cause of the issue)

Any objections before moving forward with that?

I think most of your suggestion seems reasonable. The part that bothers me is the need to rely on something like Ticker to periodically update proxy certs when no event appears "for a while" because this could result in traffic loss for the duration until this ticker updates the proxies with the new certs. This is why I think we should continue to immediately update certs that are about to expire or haven't expired yet.

steeling commented 2 years ago

ya that makes sense, although we could mitigate with having ticker time less than cert check time, but in general I do agree with your point. I think having a ticker makes sense for this and a few other cases as well.. solely relying on the ticker has its own problems though as you mention

@jaellio I want to see if i can repro this issue to confirm the cause first.

keithmattix commented 2 years ago

@steeling Can you please update as soon as you have a repro? This is a high priority issue that I think is worth the time.

steeling commented 2 years ago

Ok, so I have this currently happening, and confirmed its none of the above theories, here's my new one:

  1. Why the serial numbers are different:

The logs print the SerialNumber from the cert object which is an Aliased string, it set via c.SerialNumber = SerialNumber(x509Cert.SerialNumber.String())

The debugger prints the x509Cert.SerialNumber, which a *big.Int, via fmt.Sprintf("... %x", x509Cert.SerialNumber). This is confirmed with values taken from my environment, and demoed here https://go.dev/play/p/YueNgmGiGmw

  1. Why the certs are expired. The big red flag here is that the Valid Until: line is the debug output shows a positive time until, Valid Until: 2022-08-12 19:05:52.0391624 +0000 UTC m=+86504.429431801 (16h45m27.4651171s remaining), while the x509.NotAfter, shows a negative time until x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-71h30m32.0984815s remaining), but if you look at the timestamps they look the same, except for the m=+.

What's that? Apparently operating systems maintain both a monotonic clock, and a wall clock. The difference may account for things like time changes. The m+ shows an adjustment to the monotonic clock (my guess is this is when the kind cluster goes to sleep).

Go keeps track of both the monotonic and the wall clock. From the time package: "[...] ater time-measuring operations, specifically comparisons and subtractions, use the monotonic clock reading." This means when we check the time we get the correct time, but when we measure, we add the m=+ (I would expect the number of seconds, 86504, to equal the delta, 71h + 16, or roughly 55 hours, however it's ~24 hours, so that's currently an unknown).

The cert rotation uses a measurement check, which will use the monotonic clock. Again, I'm guessing when the kind cluster shuts down, so does the monotonic clock, and if running on the laptop will create large skews. My guess is the certs will be rotated when the time's idea of the monotonic clock deems expiration is necessary. Because of this, this flaw is mostly likely to be seen in dev environments.

Still, we should maintain parity with the x509 cert. We can fix this by truncating the time on checks with cert.GetExpiration().Round(0) (taken from time's go doc on how to ignore monotonic clock for measurements)

allenlsy commented 2 years ago

What's that? Apparently operating systems maintain both a monotonic clock, and a wall clock. The difference may account for things like time changes. The m+ shows an adjustment to the monotonic clock (my guess is this is when the kind cluster goes to sleep).

The debugging log above was printed all in one call getCertHandler(). Looking at these log lines:

     Valid Until: 2022-08-12 19:05:52.0391624 +0000 UTC m=+86504.429431801 (15h40m42.8860671s remaining)
     x509.NotBefore (begin): 2022-08-11 19:05:52 +0000 UTC (96h35m16.6775508s ago)
     x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-72h35m16.6775541s remaining)

From the log, I think the time printing the log is 2022-08-15.

Valid Until and x509.NotAfter have the same timestamp (2022-08-12 19:05:52), and both called time.Until(<time object>) in the debugger but get two very different results.

In the source code of time.Until:

// Until returns the duration until t.
// It is shorthand for t.Sub(time.Now()).
func Until(t Time) Duration {
    var now Time
    if t.wall&hasMonotonic != 0 {
        // Common case optimization: if t has monotonic time, then Sub will use only it.
        now = Time{hasMonotonic, runtimeNano() - startNano, nil}
    } else {
        now = Now()
    }
    return t.Sub(now)
}

Did these two .Until() calls use two different now value? This is mysterious to me.

With the wrong cert.GetExpiration() value, the shouldRotate() will return false. https://github.com/openservicemesh/osm/blob/de4a401c663795a6b3619666f0d45239dfa54270/pkg/certificate/manager.go#L178

steeling commented 2 years ago

@allenlsy that's due to the monotonic time, but you're on the right track!

There's a lot of text in this thread so easy to miss, but the last comment goes over it in more detail, and calls out the issue.