juju-solutions / layer-tls-client

A Juju layer for tls-client code, to add client code to your charms.
1 stars 8 forks source link

"unsupported operand type" error during unit teardown caused by stale certificates.server.cert.available flag #22

Open Vultaire opened 3 years ago

Vultaire commented 3 years ago

If, during removal of a unit using layer:tls-client and interface:tls-certificate, if layer:tls-client's store_server handler gets queued to run before interface:tls-certificate's TlsRequires.joined() method, it can result in errors during teardown under certain circumstances.

This can trigger the following failure:

  File "/var/lib/juju/agents/unit-kubernetes-worker-26/charm/reactive/tls_client.py", line 51, in store_server
    server_cert = server_cert + '\n' + chain
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

I have a few units which have hit this issue, which I was attempting to remove from the model when this occurred. They're related to vault, and were formerly related to 3 units, but only one unit remains related. This unit does not have the \.server.cert nor \.server.key values for the unit in question, nor are there any \.processed_requests fields present in the relation either for that unit, and thus TlsRequires.get_server_cert() will return (None, None).

This wouldn't be a problem if layer:tls-client's store_server handler wasn't called. And it shouldn't be called... except that it fires on certificates.server.cert.available, which should be cleared when TlsRequires.joined() is called, except that it won't be called until after store_server is.

Thus, store_server is called, thinking it has a cert to store, when in fact it does not. It pulls get_server_cert, gets a bunch of nothing back, and things blow up as described in the above traceback.

Vultaire commented 3 years ago

To give a little more precise context, here is a snippet of juju unit logs during a dispatch loop just before the error started to occur:

tracer: main dispatch loop, 19 handlers queued
tracer: ++   queue handler hooks/relations/aws-integration/requires.py:106:remove_ready:aws
tracer: ++   queue handler hooks/relations/azure-integration/requires.py:114:remove_ready:azure
tracer: ++   queue handler hooks/relations/gcp-integration/requires.py:116:remove_ready:gcp
tracer: ++   queue handler hooks/relations/http/provides.py:15:broken:ingress-proxy
tracer: ++   queue handler hooks/relations/http/requires.py:10:changed:kube-api-endpoint
tracer: ++   queue handler hooks/relations/mount/requires.py:18:broken:nfs
tracer: ++   queue handler hooks/relations/openstack-integration/requires.py:84:remove_ready:openstack
tracer: ++   queue handler hooks/relations/tls-certificates/requires.py:79:joined:certificates
tracer: ++   queue handler hooks/relations/vsphere-integration/requires.py:81:remove_ready:vsphere
tracer: ++   queue handler reactive/kubernetes_worker.py:1198:notify_master_gpu_not_enabled
tracer: ++   queue handler reactive/kubernetes_worker.py:1207:request_kubelet_and_proxy_credentials
tracer: ++   queue handler reactive/kubernetes_worker.py:1219:catch_change_in_creds
tracer: ++   queue handler reactive/kubernetes_worker.py:1464:update_registry_location
tracer: ++   queue handler reactive/kubernetes_worker.py:279:safely_join_cohort
tracer: ++   queue handler reactive/kubernetes_worker.py:400:set_app_version
tracer: ++   queue handler reactive/kubernetes_worker.py:546:send_data
tracer: ++   queue handler reactive/tls_client.py:18:store_ca
tracer: ++   queue handler reactive/tls_client.py:44:store_server
tracer: ++   queue handler reactive/tls_client.py:71:store_client
2021-02-26 00:33:22 INFO juju-log kube-control:21: Invoking reactive handler: reactive/tls_client.py:18:store_ca
2021-02-26 00:33:22 INFO juju-log kube-control:21: Invoking reactive handler: reactive/tls_client.py:44:store_server
2021-02-26 00:33:23 INFO juju-log kube-control:21: Invoking reactive handler: reactive/tls_client.py:71:store_client
2021-02-26 00:33:23 INFO juju-log kube-control:21: Invoking reactive handler: reactive/kubernetes_worker.py:279:safely_join_cohort
2021-02-26 00:33:23 INFO juju-log kube-control:21: Invoking reactive handler: reactive/kubernetes_worker.py:400:set_app_version
2021-02-26 00:33:23 INFO juju-log kube-control:21: Invoking reactive handler: reactive/kubernetes_worker.py:546:send_data
2021-02-26 00:33:24 INFO juju-log kube-control:21: Invoking reactive handler: reactive/kubernetes_worker.py:1198:notify_master_gpu_not_enabled
2021-02-26 00:33:24 INFO juju-log kube-control:21: Setting gpu=False on kube-control relation
2021-02-26 00:33:24 INFO juju-log kube-control:21: Invoking reactive handler: reactive/kubernetes_worker.py:1207:request_kubelet_and_proxy_credentials
2021-02-26 00:33:24 INFO juju-log kube-control:21: Invoking reactive handler: reactive/kubernetes_worker.py:1219:catch_change_in_creds
2021-02-26 00:33:24 INFO juju-log kube-control:21: Invoking reactive handler: reactive/kubernetes_worker.py:1464:update_registry_location
2021-02-26 00:33:24 INFO juju-log kube-control:21: Invoking reactive handler: hooks/relations/aws-integration/requires.py:106:remove_ready:aws
2021-02-26 00:33:24 INFO juju-log kube-control:21: Invoking reactive handler: hooks/relations/http/requires.py:10:changed:kube-api-endpoint
2021-02-26 00:33:25 INFO juju-log kube-control:21: Invoking reactive handler: hooks/relations/http/provides.py:15:broken:ingress-proxy
2021-02-26 00:33:25 INFO juju-log kube-control:21: Invoking reactive handler: hooks/relations/gcp-integration/requires.py:116:remove_ready:gcp
2021-02-26 00:33:25 INFO juju-log kube-control:21: Invoking reactive handler: hooks/relations/mount/requires.py:18:broken:nfs
2021-02-26 00:33:25 INFO juju-log kube-control:21: Invoking reactive handler: hooks/relations/azure-integration/requires.py:114:remove_ready:azure
2021-02-26 00:33:25 INFO juju-log kube-control:21: Invoking reactive handler: hooks/relations/tls-certificates/requires.py:79:joined:certificates
2021-02-26 00:33:25 INFO juju-log kube-control:21: Invoking reactive handler: hooks/relations/openstack-integration/requires.py:84:remove_ready:openstack
2021-02-26 00:33:25 INFO juju-log kube-control:21: Invoking reactive handler: hooks/relations/vsphere-integration/requires.py:81:remove_ready:vsphere

And once 2 of the 3 vault units went away, including the one holding the cert info for the related unit where this error occurred, here's what I saw instead:

tracer: main dispatch loop, 19 handlers queued
tracer: ++   queue handler hooks/relations/aws-integration/requires.py:106:remove_ready:aws
tracer: ++   queue handler hooks/relations/azure-integration/requires.py:114:remove_ready:azure
tracer: ++   queue handler hooks/relations/gcp-integration/requires.py:116:remove_ready:gcp
tracer: ++   queue handler hooks/relations/http/provides.py:15:broken:ingress-proxy
tracer: ++   queue handler hooks/relations/http/requires.py:10:changed:kube-api-endpoint
tracer: ++   queue handler hooks/relations/mount/requires.py:18:broken:nfs
tracer: ++   queue handler hooks/relations/openstack-integration/requires.py:84:remove_ready:openstack
tracer: ++   queue handler hooks/relations/tls-certificates/requires.py:79:joined:certificates
tracer: ++   queue handler hooks/relations/vsphere-integration/requires.py:81:remove_ready:vsphere
tracer: ++   queue handler reactive/kubernetes_worker.py:1198:notify_master_gpu_not_enabled
tracer: ++   queue handler reactive/kubernetes_worker.py:1207:request_kubelet_and_proxy_credentials
tracer: ++   queue handler reactive/kubernetes_worker.py:1219:catch_change_in_creds
tracer: ++   queue handler reactive/kubernetes_worker.py:1464:update_registry_location
tracer: ++   queue handler reactive/kubernetes_worker.py:279:safely_join_cohort
tracer: ++   queue handler reactive/kubernetes_worker.py:400:set_app_version
tracer: ++   queue handler reactive/kubernetes_worker.py:546:send_data
tracer: ++   queue handler reactive/tls_client.py:18:store_ca
tracer: ++   queue handler reactive/tls_client.py:44:store_server
tracer: ++   queue handler reactive/tls_client.py:71:store_client
2021-02-26 00:33:38 INFO juju-log certificates:27: Invoking reactive handler: reactive/tls_client.py:18:store_ca
2021-02-26 00:33:38 INFO juju-log certificates:27: Invoking reactive handler: reactive/tls_client.py:44:store_server
2021-02-26 00:33:38 ERROR juju-log certificates:27: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kubernetes-worker-26/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-kubernetes-worker-26/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-kubernetes-worker-26/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-kubernetes-worker-26/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-kubernetes-worker-26/charm/reactive/tls_client.py", line 51, in store_server
    server_cert = server_cert + '\n' + chain
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Vultaire commented 3 years ago

While this may not work for everyone, in my particular case I was able to work around this issue and unblock unit removal by running the following SQL against the /.unit-state.db sqlite3 DB: delete from kv where key='reactive.states.certificates.server.cert.available';