Stono commented 6 years ago

Hi, We're having issues with the mTLS certificate rotation causing outages between our services. As you can see here, one of our services is saying the response was in an unexpected format, that is because the response was upstream connect error or disconnect/reset before headers from envoy, rather than the json it was expecting. As you can see this was multiple attempts over the space of 20 seconds (quite a while really when this is happening every hour).

2018-02-04 15:47:12,255 ERROR s.c.d.HttpResponseErrorHandler:30 - SearchOne returns error response in unexpected format.
upstream connect error or disconnect/reset before headers

2018-02-04 15:47:18,836 ERROR s.c.d.HttpResponseErrorHandler:30 - SearchOne returns error response in unexpected format.
upstream connect error or disconnect/reset before headers

2018-02-04 15:47:22,163 ERROR s.c.d.HttpResponseErrorHandler:30 - SearchOne returns error response in unexpected format.
upstream connect error or disconnect/reset before headers

2018-02-04 15:47:32,290 ERROR s.c.d.HttpResponseErrorHandler:30 - SearchOne returns error response in unexpected format.
upstream connect error or disconnect/reset before headers

I noticed the correlation between the certificate rotations on the service it was connecting to:

2018-02-04T15:47:30.109823Z     info    watchFileEvents: "/etc/certs//..2982_04_02_15_47_30.137088602": CREATE
2018-02-04T15:47:30.109934Z     info    watchFileEvents: "/etc/certs//..2982_04_02_15_47_30.137088602": MODIFY|ATTRIB
2018-02-04T15:47:30.109969Z     info    watchFileEvents: "/etc/certs//..data_tmp": CREATE
2018-02-04T15:47:30.110021Z     info    watchFileEvents: "/etc/certs//..data_tmp": RENAME
2018-02-04T15:47:30.110044Z     info    watchFileEvents: "/etc/certs//..data": CREATE
2018-02-04T15:47:30.110074Z     info    watchFileEvents: "/etc/certs//..2982_04_02_14_47_15.251940806": DELETE
2018-02-04T15:47:40.110091Z     info    watchFileEvents: notifying
2018-02-04T15:47:40.110452Z     info    Received new config, resetting budget
2018-02-04T15:47:40.110471Z     info    Reconciling configuration (budget 10)
2018-02-04T15:47:40.110578Z     info    Epoch 10 starting

This causes us some issues, we can't have instances of our service going down hourly because of this so would love some support on fixing the issue.

Also, the feedback loop of an hour for changes here makes debugging and testing this issue quite difficult, is there any way to force an epoch rotation on a service?

sakshigoel12 commented 6 years ago

@nmittler @wattli Please could you take a look at this issue

wattli commented 6 years ago

There is a way to config rotation period. https://github.com/istio/istio/blob/master/security/cmd/istio_ca/main.go#L135

And you can change it at https://github.com/istio/istio/blob/master/install/kubernetes/templates/istio-ca.yaml.tmpl#L39

Just add something like:

--workload-cert-ttl
10m

wattli commented 6 years ago

And I would follow these debugging steps:

1) SSH into the caller pod, exam the cert at /etc/certs/cert-chain.pem by this command:

openssl x509 -noout -text -in /etc/certs/cert-chain.pem

[note that if you need openssl installed, if not, you can copy this cert out to the local machine]

Usually i will check the timestamp of the decode cert, make sure it's valid (the lifetime is usually one hour). If the cert has already expired, then CA must be down or something.

2) Do the same thing for callee pod.

Stono commented 6 years ago

@wattli thank you for the tip around ca rotation, i've lowered it to a minute whilst I do some debugging, and it's allowed me to get closer to the problem:

Executing curl on on the istio-proxy of service a to service b works, until the certificate is close to expiry, and then you see a certificate has expired message:

istio-proxy@at-sauron-web-6c9648f448-9kp7x:/$ curl https://at-search-one-service:80 -v --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem -k
* Rebuilt URL to: https://at-search-one-service:80/
*   Trying 10.192.12.38...
* Connected to at-search-one-service (10.192.12.38) port 80 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 0 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* gnutls_handshake() failed: Certificate is expired
* Closing connection 0
curl: (35) gnutls_handshake() failed: Certificate is expired

Then, sometimes even after certificate rotation we get:

istio-proxy@at-sauron-web-6c9648f448-9kp7x:/$ curl https://at-search-one-service:80 -v --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem -k
* Rebuilt URL to: https://at-search-one-service:80/
*   Trying 10.192.12.38...
* Connected to at-search-one-service (10.192.12.38) port 80 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 0 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* gnutls_handshake() failed: Error in the push function.
* Closing connection 0
curl: (35) gnutls_handshake() failed: Error in the push function.

And then eventually it will work (before breaking again at the next rotation):

istio-proxy@at-sauron-web-6c9648f448-9kp7x:/$ curl https://at-search-one-service:80 -v --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem -k
* Rebuilt URL to: https://at-search-one-service:80/
*   Trying 10.192.12.38...
* Connected to at-search-one-service (10.192.12.38) port 80 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 0 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification SKIPPED
*        server certificate status verification SKIPPED
* error fetching CN from cert:The requested data were not available.
*        common name:  (does not match 'at-search-one-service')
*        server certificate expiration date OK
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3
*        subject: O=#1300
*        start date: Tue, 06 Feb 2018 08:48:42 GMT
*        expire date: Tue, 06 Feb 2018 08:49:42 GMT
*        issuer: O=k8s.cluster.local
*        compression: NULL
* ALPN, server did not agree to a protocol
> GET / HTTP/1.1
> Host: at-search-one-service:80
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
< Expires: 0
< Cache-Control: no-cache, no-store, max-age=0, must-revalidate
< X-XSS-Protection: 1; mode=block
< Server: SearchOne (c) 2014-2018, AutoTrader PLC
< Pragma: no-cache
< X-Frame-Options: DENY
< Date: Tue, 06 Feb 2018 08:49:38 GMT
< Connection: keep-alive
< WWW-Authenticate: Basic realm="Realm"
< X-Content-Type-Options: nosniff
< Transfer-Encoding: chunked
< Content-Type: application/json;charset=UTF-8
<
{
  "timestamp" : 1517906978181,
  "status" : 401,
  "error" : "Unauthorized",
  "message" : "Full authentication is required to access this resource",
  "path" : "/"
* Connection #0 to host at-search-one-service left intact

Stono commented 6 years ago

OK, so watching this a little more, the certificates aren't rotating on even periods either.

So this was one epoch:

        Validity
            Not Before: Feb  6 09:05:40 2018 GMT
            Not After : Feb  6 09:06:40 2018 GMT

And then the next epoch (which would have needed to be 09:07:40 in order to guarantee a valid cert):

        Validity
            Not Before: Feb  6 09:06:52 2018 GMT
            Not After : Feb  6 09:07:52 2018 GMT

As you can see there is a 12 second window in which service a would have been using an invalid certificate!

Stono commented 6 years ago

Just to confirm this, heres another epoch rotation this time with a 26 second window:

        Validity
            Not Before: Feb  6 09:08:20 2018 GMT
            Not After : Feb  6 09:09:20 2018 GMT

        Validity
            Not Before: Feb  6 09:09:46 2018 GMT
            Not After : Feb  6 09:10:46 2018 GMT

Stono commented 6 years ago

It feels to me like you have a "black spot" in certificate validity, and if you created your certificates with a buffer either side of the --workload-cert-ttl you may not have this problem.

That buffer should be at least a multiple of the drainDuration

Stono commented 6 years ago

Right, i've got an even more specific example of the problem now which I can demonstrate on the call later.

Set the --workload-cert-ttl duration to 2 minutes
Add two services that both run under the default service account (therefore their secret is default/istio.default)
Tail -c istio-proxy on both service 1 and service 2

You will see the certificate rotate on one of the services, but not the other, and you'll end up with both services in play but one of them using an out of date certificate!

Stono commented 6 years ago

Does the istio-ca and the envoy epoch run on their own interval? Because what i'm observing is istio-ca correctly update the certificate secret within the time window but then the proxy not update its copy of the secrets that are mounted until arbitrary epoch time passes, so:

00:00: secret default/istio.default already exists and has, say, ttl 30s left
00:01: new instance of pod spawns, mounts secret, all works
00:20: istio-ca correctly detects secret default/istio.default near expiry and updates secret
00:30: pod stops working as certificate has expired
02:00: epoch triggers updating cert and pod starts working again

wattli commented 6 years ago

@myidpt is looking into this.

myidpt commented 6 years ago

@Stono from the latter error you described, I think it is due to proxy agent not restarting Evnoy to reload the certs timely. Because in the two-service test that you did, the services are sharing the default service account, in K8s, there is only one key/cert for each service account, which is mounted to all pods that use the service account. So if one services sees the new key/cert, the other one should also have it, unless K8s secret mount has issue, which is unlikely to happen. On proxy agent, the file watcher interval is 10 sec, which should be good enough: https://github.com/istio/istio/blob/release-0.5.1/pilot/pkg/proxy/envoy/watcher.go#L79 Maybe there's other bugs, or restarting Envoy takes too much time. I'm still investigating this.

Stono commented 6 years ago

Hey @myidpt -I just had a call with a couple of the devs. I actually think the issue stems from https://github.com/istio/istio/blob/release-0.5.1/security/pkg/pki/ca/controller/secret.go#L51 (the ca will only renew certificates if they're less than 1 minute to expiry).

This is too close because then the envoys still need to tick over and pull the update.

In some scenarios I've actually seen istio-ca renewing certs AFTER they've expired.

myidpt commented 6 years ago

Thanks for investigating. That's definitely too close. I'm still trying to see why istio-ca could renew the certs AFTER they are expired. To get more context - you are using 0.5, right? and have you tested on Istio 0.4?

Stono commented 6 years ago

@myidpt : We didn't, but we also have done as extensive testing on 0.4.0 as we have been doing on 0.5.0 so it's quite possible it was there and slipped under the radar! I can test on 0.4.0 quite easily now as we've bootstrapped an istio installation.. Give me 10 minutes.

Stono commented 6 years ago

@myidpt actually I can't test it easily because --workload-cert-ttl isn't available on 0.4.0 and i don't have an hour to wait :D put it this way we were running 0.4.0 for about 3 weeks and can't say we noticed this however, we weren't running 24/7 constant tests like we are now...

myidpt commented 6 years ago

I see. Thanks @Stono. FYI, workload-cert-ttl was called cert-ttl: https://github.com/istio/istio/blob/release-0.4.0/security/cmd/istio_ca/main.go#L109 But that's OK we can do the test :)

myidpt commented 6 years ago

This is the fix for V0.5.1: https://github.com/istio/istio/pull/3242

istio / istio

mTLS certificate rotation resulting in outages #3155

openssl x509 -noout -text -in /etc/certs/cert-chain.pem