skupperproject / skupper-router

An application-layer router for Skupper networks
https://skupper.io
Apache License 2.0
14 stars 18 forks source link

TLS certificate reload during connection setup can block pending PN_PROACTOR_TIMEOUT processing #1572

Open kgiusti opened 1 month ago

kgiusti commented 1 month ago

This issue is causing this failure https://github.com/skupperproject/skupper-router/issues/1551

The "fix" for #1551 was to remove a heartbeat. That simply removed one possible case where delayed timer processing can cause timeout errors. The issue still exists.

Root cause: with the new tcp adaptor we've modified how TLS is configured. The desired behavior is to have changes to the related TLS certificate files be adopted when a new connection is established. This means that the management plane can update TLS cert files and the changes will take effect on the next new connection.

The router implementation of this is flawed. Currently there is no mechanism to alert the router that the cert files have been updated. Instead I've implemented the brain-dead approach of simply reloading the cert files unconditionally every time a new TLS connection is established.

This is bad: loading a cert file is a time consuming (thread blocking) operation. On my bare metal laptop I've observed this operation taking nearly a second to complete. Since the reload is taking place on the I/O thread (new connection event) that I/O thread can be blocked "outside" of proactor for a long time.

In the case of the test_80_balancing failure what is happening is the test is bringing up nearly 100 short-lived TCP client connections simultaneously. This is not unreasonable behavior as high connection rates will occur in the field. The problem is that this results in blocking the few (default 4) I/O threads in the cert reload call. This means there are no available threads for proactor to schedule other work - specifically expiration of the proactor timer.

IIUC proactor does not implement a thread scheduling priority, which means timer events won't take precedence over I/O events. Therefore during a heavily loaded I/O event it's possible the timer handler won't run in a timely manner.

TL:DR - under a heavy connection load TLS processing can delay servicing the system timer long enough to fail timed operations.