sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
737 stars 1.42k forks source link

Gnmi Client Access Fails with Server after Rotating Telemetry Certificates #19649

Closed wumiaont closed 3 months ago

wumiaont commented 3 months ago

Description

Issue is found during test of test_telemetry_cert_rotation. Two cases are failing here. test_telemetry_cert_rotate and test_telemetry_cert_add. This is related to certificate rotation used in the test.

After certificatea are rotated under /etc/sonic/telemetry, gnmi service is restarted automatically to accept new certs. It's found that after that gnmi client connection to server fails.

root@6b065efa96a8:~# python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t 10.250.6.231 -p 8080 -m get -x proc/uptime -xt OTHERS -o ndastreamingservertest Performing GetRequest, encoding=JSON_IETF to 10.250.6.231 with the following gNMI Path [elem { name: "proc" } elem { name: "uptime" } ] E0722 18:00:31.336695621 30083 ssl_transport_security.cc:1469] Handshake failed with fatal error SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED. Client receives an exception 'failed to connect to all addresses' indicating gNMI server is shut down and Exiting ...

From the error seems the telemetry process is not allowing no client auth even it's set.

This is the telemetry process inside gnmi server. root 21 1 1 17:59 pts/0 00:00:00 /usr/sbin/telemetry -logtostderr --server_crt /etc/sonic/telemetry/streamingtelemetryserver.cer --server_key /etc/sonic/telemetry/streamingtelemetryserver.key --ca_crt /etc/sonic/telemetry/dsmsroot.cer --port 8080 --allow_no_client_auth -v=2 --threshold 100 --idle_conn_duration 5

Steps to reproduce the issue:

  1. Follow instructions with test test_telemetry_cert_rotate. You can manually generate server cert and CA cert using openssl to replace the existing certs.
  2. Watch the gnmi container is restarted with the certs change.
  3. Run from ptf server.
    python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t 10.250.6.231 -p 8080 -m get -x proc/uptime -xt OTHERS -o ndastreamingservertest It will fail with CERTIFICATE_VERIFY_FAILED. Restart gnmi process may make gnmi client to work. But with my testing it could succeed or still fail after restart gnmi.

Describe the results you received:

telemetry cert rotation test failed

Describe the results you expected:

telemetry cert rotation should work with no issue.

Output of show version:

202405 and master

wumiaont commented 3 months ago

It's found that when creating certs through duthost.cmd. The start time of the cert is 6 minutes later. This is causing cert verify failure and test failed. openssl x509 -in streamingtelemetryserver.cer -text -noout Certificate: Data: Version: 3 (0x2) Serial Number: 2b:a3:2c:5b:53:d3:7e:81:f6:7d:2d:4d:68:0d:cd:79:34:40:ab:d3 Signature Algorithm: sha256WithRSAEncryption Issuer: CN = ndastreamingservertest Validity Not Before: Jul 25 16:41:19 2024 GMT Not After : Aug 24 16:41:19 2024 GMT Subject: CN = ndastreamingservertest Subject Public Key Info: Public Key Algorithm: rsaEncryption Public-Key: (2048 bit) Modulus: 00:be:9a:b5:c5:b1:62:3d:c4:e9:8a:9b:05:bb:98: 8e:cc:08:dc:88:c8:07:81:34:8f:f3:b3:a3:ff:c7: 86:85:34:9c:a6:67:5d:3a:74:86:88:b6:f6:e1:78: ec:56:7e:1a:d7:a8:5e:de:3c:3f:97:df:73:84:13: 9f:6a:e2:08:f9:7a:77:59:e0:8b:47:e0:4e:70:82: a4:da:99:de:da:57:38:0b:62:c4:6a:b8:fb:cf:b1: d2:ea:21:9e:74:0f:9e:55:c7:56:b0:97:a5:00:f1: 46:97:a5:f6:49:71:35:09:ee:9b:b0:f9:83:99:bc: bd:02:b6:f9:06:ce:ed:ad:6e:16:db:99:f0:8e:a8: 37:7c:4c:62:e9:36:ff:58:bd:28:42:5f:af:ea:c7: 62:33:91:f8:1f:6f:83:65:8a:fc:4d:e5:5e:19:27: 9c:35:fb:4c:c9:08:b8:47:a9:c4:ab:4f:cc:b9:2e: bf:5a:4f:5f:37:07:b7:54:28:cf:98:b7:7f:42:da: 22:22:73:b0:87:2f:c4:43:90:d2:9d:9c:54:c4:cb: d7:24:67:d1:ba:a3:35:25:10:26:0c:a1:ee:f8:3b: fb:d4:ca:a8:fa:53:50:bf:4e:2a:54:be:36:a3:fa: fb:2f:e4:78:b8:60:42:a0:ce:5a:db:fe:f8:a8:9d: 3e:95 Exponent: 65537 (0x10001) X509v3 extensions: X509v3 Subject Key Identifier: DF:15:F6:D3:DA:C2:09:EB:8D:3E:69:C2:A4:94:F8:7D:E2:05:74:C2 X509v3 Authority Key Identifier: DF:15:F6:D3:DA:C2:09:EB:8D:3E:69:C2:A4:94:F8:7D:E2:05:74:C2 X509v3 Basic Constraints: critical CA:TRUE Signature Algorithm: sha256WithRSAEncryption Signature Value: 26:88:71:22:da:4d:c6:9b:a3:99:67:cf:4b:44:79:c5:b9:e9: a6:13:f5:1d:45:c8:ce:a7:a7:d1:52:9c:ca:51:33:84:fb:a4: 79:40:a0:e4:34:d4:7f:bc:39:21:09:e2:3d:2a:d1:15:ad:a0: a3:88:b3:a8:e5:3b:f0:1f:6d:a9:66:65:ac:68:96:b1:3c:fa: 1c:a5:ae:a1:76:bc:48:72:39:5a:42:fa:bd:2b:c5:d2:60:3a: f0:a2:5b:de:4a:62:f6:21:01:22:44:1f:90:be:de:c3:4d:b0: a2:fd:11:c6:96:66:28:74:2d:6b:10:05:b6:4d:e3:a4:72:81: 89:01:5e:56:f7:1e:fb:fe:61:48:9e:c0:9d:09:ea:42:d7:3c: af:7c:36:22:b4:f8:48:b1:b8:e0:f6:5c:37:63:ed:78:df:b5: b4:72:a9:04:e3:7a:10:23:b1:cf:2f:f7:d5:c0:31:12:1a:14: 49:64:5b:9b:1d:c2:f7:35:fb:e9:fe:be:ee:f1:bd:c6:13:20: 4c:f7:1a:cf:a1:db:05:c8:b6:79:cc:2d:fd:b8:91:cd:2b:17: 22:1d:46:c0:ab:bc:28:ed:76:2c:c2:c8:53:56:64:5a:67:b0: 1b:8f:36:f6:45:94:d6:10:aa:f7:9e:48:c3:9e:82:a7:03:67: 97:cc:c4:f9 admin@ixre-egl-board40:/etc/sonic/telemetry$ admin@ixre-egl-board40:/etc/sonic/telemetry$ date Thu Jul 25 04:35:43 PM UTC 2024

wumiaont commented 3 months ago

Have tested if I create certificate from the host console, the valid time is just now. There's no delay.

wumiaont commented 3 months ago

def rotate_telemetry_certs(duthost, localhost): path = "/etc/sonic/telemetry/"

Create new certs to rotate

cmd = "sudo openssl req \
          -x509 \
          -sha256 \
          -nodes \
          -newkey rsa:2048 \
          -keyout streamingtelemetryserver.key \
          -subj '/CN=ndastreamingservertest' \
          -out streamingtelemetryserver.cer"
localhost.shell(cmd)
cmd = "sudo openssl req \
          -x509 \
          -sha256 \
          -nodes \
          -newkey rsa:2048 \
          -keyout dsmsroot.key \
          -subj '/CN=ndastreamingclienttest' \
          -out dsmsroot.cer"
localhost.command(cmd)
wumiaont commented 3 months ago

Found issue is related to the mgmt container time. It's found that the container time is 7 minutes late. Still a misty of why that causes the cert generated on the duthost has late time. With syncing the time for the mgmt container and run OC again issue is not found.