openziti / ziti

The parent project for OpenZiti. Here you will find the executables for a fully zero trust, application embedded, programmable network @OpenZiti
https://openziti.io
Apache License 2.0
2.69k stars 153 forks source link

ziti router run error with new release 0.27.3 #993

Closed qrkourier closed 1 year ago

qrkourier commented 1 year ago

I started seeing an error from ziti router run with the upgrade to 0.27.3. With the same config I can roll back the binary to 0.27.2 as a workaround.

It appears that in the new release ziti router run has a new behavior that attempts to open a file named endpoints in the same directory as the router's config YAML file. I couldn't find any notes about this in the changelog, and I didn't find any interesting occurrences of "open" or "endpoints" in ziti/ziti/router.go.

[   0.413]   DEBUG ziti/ziti/util.LogReleaseVersionCheck: this build of ziti-router is the latest release v0.27.3
[   0.415]    INFO ziti/ziti/router.run: {configFile=[/etc/ziti/ziti-router.yaml] go-version=[go1.19.5] os=[linux] arch=[amd64] version=[v0.27.3] build-date=[2023-02-03T19:53:13Z] routerId=[r-iy3in.bE] revision=[dd3fee35c889]} starting ziti-router
[   0.415]    INFO fabric/router/forwarder.(*Faulter).run: started
[   0.415]    INFO fabric/router/forwarder.(*Scanner).run: started
[   0.416]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {maxWorkers=[32] minWorkers=[0] poolType=[pool.link.dialer] idleTime=[30s] maxQueueSize=[1000]} starting goroutine pool
[   0.416]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {idleTime=[30s] maxQueueSize=[1000] maxWorkers=[128] poolType=[pool.route.handler] minWorkers=[0]} starting goroutine pool
[   0.416] WARNING edge/router/internal/edgerouter.(*Config).LoadConfigFromMap: Invalid heartbeat interval [0] (min: 60, max: 10), setting to default [60]
[   0.416]   DEBUG edge/router/internal/apiproxy.Start: API Proxy disabled
[   0.416]   FATAL ziti/ziti/router.run: {error=[open /etc/ziti/endpoints: permission denied]} error starting

router config

v: 3
identity:
  cert: ${ZITI_ROUTER_IDENTITY_DIR}/client.crt
  server_cert: ${ZITI_ROUTER_IDENTITY_DIR}/tls.crt
  key: ${ZITI_ROUTER_IDENTITY_DIR}/tls.key
  ca: ${ZITI_ROUTER_IDENTITY_DIR}/ca.crt
ctrl:
  endpoint: tls:ctrl1.k3s.mira.qrk.us:30262
link:
  dialers:
    - binding: transport
listeners:
  - binding: edge
    address: tls:0.0.0.0:1290
    options:
      advertise: edge1.k3s.mira.qrk.us:1290
      connectTimeoutMs: 1000
      getSessionTimeout: 60
  - binding: tunnel
    options:
      mode: host
edge:
  csr:
    sans:
      dns:
        - localhost
        - edge1.k3s.mira.qrk.us
        - edge1.k3s.mira.qrk.us
      ip:
        - 127.0.0.1
forwarder:
  latencyProbeInterval: 10
  xgressDialQueueLength: 1000
  xgressDialWorkerCount: 128
  linkDialQueueLength: 1000
  linkDialWorkerCount: 32
qrkourier commented 1 year ago

The value of env var ZITI_ROUTER_IDENTITY_DIR is /etc/ziti/identity.

qrkourier commented 1 year ago

The full command line was ziti router run /etc/ziti/ziti-router.yaml.

qrkourier commented 1 year ago

Providing a writeable directory seems to have appeased ziti router run.

bash-4.4$ ls -ld /etc/ziti/endpoints/
drwxrwxrwx 2 root root 4096 Feb  3 18:08 /etc/ziti/endpoints/
plorenz commented 1 year ago

This file is how routers will keep track changing nodes in an HA cluster. We probably don't need to create it in a non-HA mode. Let me look into it

plorenz commented 1 year ago

Is the container read-only? The config file itself can be updated if the controller is advertising a new address even in non-HA mode

qrkourier commented 1 year ago

Right, the config and identities are mounted read-only from Kubernetes resources. This allows their data to be declared by Helm. I can add a storage dependency, but it would be good to characterize the storage in terms of capacity, durability, IO. It sounds like {config dir}/endpoints is a single, regular file that stores transient state.

As for the writeable router config file, I'm not sure how to handle that. It doesn't fit into the MO of declarative config mgmt. We can brainstorm on it if you like. There's a Kubernetes way but maybe it needs to be a configurable behavior. Will the router FATAL error if it can't clobber its own config file?

qrkourier commented 1 year ago

I would like to understand the purpose of the address update event you mentioned. This allows the router control plane to declare a new router advertisement, or is this to allow the router control plane to communicate a change of its endpoint address to the router? I strongly suspect it is the latter and is intended to facilitate a change of the global domain name of the router control plane endpoint.

Is this event that needs to be persisted as a router configuration change representative of a broader category of events, or is it more of an anomaly? If it's a one-off, then it might make more sense for the Kubernetes users to think about this problem in terms of tightly-coupled Ziti controller+Ziti router configs instead of handling Ziti control plane events. An example of such a coupled configuration is Helm umbrella chart that would deploy both controller and router at the same time with the same values. In that scenario, the only way to change the controller's advertised address is to change the value that is consumed by both controller and router, and allow Kubernetes to propagate the change.

Here's a sketch of a design to minimally change the router while adding a Kubernetes Controller process to handle the router's events that need to be persisted as configuration changes.

ziti-kontroller (2)

sequenceDiagram
    loop Ziti Router ConfigMap Controller control loop
        router->>router.yml: read file at startup
        ziti ctrl plane->>k8s-ctrl: event: new Ziti controller control plane endpoint
        Note right of ziti ctrl plane: ignored by the read-only router
        k8s-ctrl->>router.yml: update configmap
    end

The ConfigMap Controller could also delete the router pod to begin a new process life cycle to load the updated config, but a frequent reloading or signaling behavior would be less intrusive.

plorenz commented 1 year ago

There's a fix for openziti/fabric#601 which will prevent the endpoints file from being written as long as there are no changes. However, for proper functioning in an HA environment where controllers can come and go, the image will need to allow the endpoints file be created and writable.

qrkourier commented 1 year ago

@plorenz That's still doable, and I'd still like to specify the runtime storage needs in terms of durability and capacity, at least. Is a small ephemeral volume sufficient or will the router expect to find state in {config dir}/endpoints file in a subsequent process life cycle, and therefore require persistence across restarts?

plorenz commented 1 year ago

@plorenz That's still doable, and I'd still like to specify the runtime storage needs in terms of durability and capacity, at least. Is a small ephemeral volume sufficient or will the router expect to find state in {config dir}/endpoints file in a subsequent process life cycle, and therefore require persistence across restarts?

Ideally it would be persistent. The file is only used to persist the information across restarts.

qrkourier commented 1 year ago

What is a reasonable capacity for the storage volume where the endpoints file is written? I've written the Kubernetes deployments to default to a 50MiB persistent volume.

plorenz commented 1 year ago

What is a reasonable capacity for the storage volume where the endpoints file is written? I've written the Kubernetes deployments to default to a 50MiB persistent volume.

Endpoints file should generally be tiny (under 1k in size). It's only listing controller endpoints. You'd need a crazy number of controllers (or very long dns names) to make that file large.