Closed Stono closed 6 years ago
@nmittler @wattli Please could you take a look at this issue
There is a way to config rotation period. https://github.com/istio/istio/blob/master/security/cmd/istio_ca/main.go#L135
And you can change it at https://github.com/istio/istio/blob/master/install/kubernetes/templates/istio-ca.yaml.tmpl#L39
Just add something like:
And I would follow these debugging steps:
1) SSH into the caller pod, exam the cert at /etc/certs/cert-chain.pem by this command:
[note that if you need openssl installed, if not, you can copy this cert out to the local machine]
Usually i will check the timestamp of the decode cert, make sure it's valid (the lifetime is usually one hour). If the cert has already expired, then CA must be down or something.
2) Do the same thing for callee pod.
@wattli thank you for the tip around ca rotation, i've lowered it to a minute whilst I do some debugging, and it's allowed me to get closer to the problem:
Executing curl on on the istio-proxy
of service a
to service b
works, until the certificate is close to expiry, and then you see a certificate has expired
message:
istio-proxy@at-sauron-web-6c9648f448-9kp7x:/$ curl https://at-search-one-service:80 -v --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem -k
* Rebuilt URL to: https://at-search-one-service:80/
* Trying 10.192.12.38...
* Connected to at-search-one-service (10.192.12.38) port 80 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 0 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* gnutls_handshake() failed: Certificate is expired
* Closing connection 0
curl: (35) gnutls_handshake() failed: Certificate is expired
Then, sometimes even after certificate rotation we get:
istio-proxy@at-sauron-web-6c9648f448-9kp7x:/$ curl https://at-search-one-service:80 -v --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem -k
* Rebuilt URL to: https://at-search-one-service:80/
* Trying 10.192.12.38...
* Connected to at-search-one-service (10.192.12.38) port 80 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 0 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* gnutls_handshake() failed: Error in the push function.
* Closing connection 0
curl: (35) gnutls_handshake() failed: Error in the push function.
And then eventually it will work (before breaking again at the next rotation):
istio-proxy@at-sauron-web-6c9648f448-9kp7x:/$ curl https://at-search-one-service:80 -v --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem -k
* Rebuilt URL to: https://at-search-one-service:80/
* Trying 10.192.12.38...
* Connected to at-search-one-service (10.192.12.38) port 80 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 0 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* server certificate verification SKIPPED
* server certificate status verification SKIPPED
* error fetching CN from cert:The requested data were not available.
* common name: (does not match 'at-search-one-service')
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: O=#1300
* start date: Tue, 06 Feb 2018 08:48:42 GMT
* expire date: Tue, 06 Feb 2018 08:49:42 GMT
* issuer: O=k8s.cluster.local
* compression: NULL
* ALPN, server did not agree to a protocol
> GET / HTTP/1.1
> Host: at-search-one-service:80
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
< Expires: 0
< Cache-Control: no-cache, no-store, max-age=0, must-revalidate
< X-XSS-Protection: 1; mode=block
< Server: SearchOne (c) 2014-2018, AutoTrader PLC
< Pragma: no-cache
< X-Frame-Options: DENY
< Date: Tue, 06 Feb 2018 08:49:38 GMT
< Connection: keep-alive
< WWW-Authenticate: Basic realm="Realm"
< X-Content-Type-Options: nosniff
< Transfer-Encoding: chunked
< Content-Type: application/json;charset=UTF-8
<
{
"timestamp" : 1517906978181,
"status" : 401,
"error" : "Unauthorized",
"message" : "Full authentication is required to access this resource",
"path" : "/"
* Connection #0 to host at-search-one-service left intact
OK, so watching this a little more, the certificates aren't rotating on even periods either.
So this was one epoch:
Validity
Not Before: Feb 6 09:05:40 2018 GMT
Not After : Feb 6 09:06:40 2018 GMT
And then the next epoch (which would have needed to be 09:07:40
in order to guarantee a valid cert):
Validity
Not Before: Feb 6 09:06:52 2018 GMT
Not After : Feb 6 09:07:52 2018 GMT
As you can see there is a 12 second window in which service a
would have been using an invalid certificate!
Just to confirm this, heres another epoch rotation this time with a 26 second window:
Validity
Not Before: Feb 6 09:08:20 2018 GMT
Not After : Feb 6 09:09:20 2018 GMT
Validity
Not Before: Feb 6 09:09:46 2018 GMT
Not After : Feb 6 09:10:46 2018 GMT
It feels to me like you have a "black spot" in certificate validity, and if you created your certificates with a buffer either side of the --workload-cert-ttl
you may not have this problem.
That buffer should be at least a multiple of the drainDuration
Right, i've got an even more specific example of the problem now which I can demonstrate on the call later.
--workload-cert-ttl
duration to 2 minutes-c istio-proxy
on both service 1
and service 2
You will see the certificate rotate on one of the services, but not the other, and you'll end up with both services in play but one of them using an out of date certificate!
Does the istio-ca and the envoy epoch run on their own interval? Because what i'm observing is istio-ca correctly update the certificate secret within the time window but then the proxy not update its copy of the secrets that are mounted until arbitrary epoch time passes, so:
default/istio.default
already exists and has, say, ttl 30s leftdefault/istio.default
near expiry and updates secret@myidpt is looking into this.
@Stono from the latter error you described, I think it is due to proxy agent not restarting Evnoy to reload the certs timely. Because in the two-service test that you did, the services are sharing the default service account, in K8s, there is only one key/cert for each service account, which is mounted to all pods that use the service account. So if one services sees the new key/cert, the other one should also have it, unless K8s secret mount has issue, which is unlikely to happen. On proxy agent, the file watcher interval is 10 sec, which should be good enough: https://github.com/istio/istio/blob/release-0.5.1/pilot/pkg/proxy/envoy/watcher.go#L79 Maybe there's other bugs, or restarting Envoy takes too much time. I'm still investigating this.
Hey @myidpt -I just had a call with a couple of the devs. I actually think the issue stems from https://github.com/istio/istio/blob/release-0.5.1/security/pkg/pki/ca/controller/secret.go#L51 (the ca will only renew certificates if they're less than 1 minute to expiry).
This is too close because then the envoys still need to tick over
and pull the update.
In some scenarios I've actually seen istio-ca renewing certs AFTER they've expired.
Thanks for investigating. That's definitely too close. I'm still trying to see why istio-ca could renew the certs AFTER they are expired. To get more context - you are using 0.5, right? and have you tested on Istio 0.4?
@myidpt : We didn't, but we also have done as extensive testing on 0.4.0 as we have been doing on 0.5.0 so it's quite possible it was there and slipped under the radar! I can test on 0.4.0 quite easily now as we've bootstrapped an istio installation.. Give me 10 minutes.
@myidpt actually I can't test it easily because --workload-cert-ttl
isn't available on 0.4.0
and i don't have an hour to wait :D put it this way we were running 0.4.0 for about 3 weeks and can't say we noticed this however, we weren't running 24/7 constant tests like we are now...
I see. Thanks @Stono. FYI, workload-cert-ttl was called cert-ttl: https://github.com/istio/istio/blob/release-0.4.0/security/cmd/istio_ca/main.go#L109 But that's OK we can do the test :)
This is the fix for V0.5.1: https://github.com/istio/istio/pull/3242
Hi, We're having issues with the mTLS certificate rotation causing outages between our services. As you can see here, one of our services is saying the response was in an unexpected format, that is because the response was
upstream connect error or disconnect/reset before headers
from envoy, rather than the json it was expecting. As you can see this was multiple attempts over the space of 20 seconds (quite a while really when this is happening every hour).I noticed the correlation between the certificate rotations on the service it was connecting to:
This causes us some issues, we can't have instances of our service going down hourly because of this so would love some support on fixing the issue.
Also, the feedback loop of an hour for changes here makes debugging and testing this issue quite difficult, is there any way to force an epoch rotation on a service?