openziti / helm-charts

various helm charts for openziti-test-kitchen projects
https://openziti.io/helm-charts/
Apache License 2.0
7 stars 8 forks source link

ziti-router: introduced health checks fail #226

Open marvkis opened 3 months ago

marvkis commented 3 months ago

Hi,

I tried to upgrade to the latest helm charts. For my router the deployment never became active and was restarting. After some investigation I noticed the health checks introduced in this commit https://github.com/openziti/helm-charts/commit/3c2afc53c49f8a3c7d24ceb6d199c88d2d9474aa fialed to succeed. After manually removing them from the deplyment it became active and everything worked as expected.

I tried to execute them manually in the pod and it seems ziti agent stats fails to find the ziti router process:

[ziggy@core-router-858f76f8fb-fvtqm ~]$ ziti agent stats
Error: no processes found matching filter, use 'ziti agent list' to list candidates
Usage:
  ziti agent stats [flags]

Flags:
  -a, --app-alias string      Alias of host application to talk to (specified in host application)
  -i, --app-id string         Id of host application to talk to (like controller or router id)
  -t, --app-type string       Type of host application to talk to (like controller or router)
  -h, --help                  help for stats
  -p, --pid uint32            Process ID of host application to talk to
  -n, --process-name string   Process name of host application to talk to
      --tcp-addr string       Type of host application to talk to (like controller or router)
      --timeout duration      Operation timeout (default 5s)

no processes found matching filter, use 'ziti agent list' to list candidates
[ziggy@core-router-858f76f8fb-fvtqm ~]$ ziti agent list
╭─────┬────────────┬────────┬─────────────┬──────────┬─────────────┬───────────╮
│ PID │ EXECUTABLE │ APP ID │ UNIX SOCKET │ APP TYPE │ APP VERSION │ APP ALIAS │
├─────┼────────────┼────────┼─────────────┼──────────┼─────────────┼───────────┤
╰─────┴────────────┴────────┴─────────────┴──────────┴─────────────┴───────────╯

Hint: I'm using security contexts as additional security layer on my containers:

  podSecurityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - ALL
    privileged: false
    readOnlyRootFilesystem: true
    runAsGroup: 2171
    runAsNonRoot: true
    runAsUser: 2171
    seccompProfile:
      type: RuntimeDefault
  securityContext:
    fsGroup: 2171
    runAsNonRoot: true

might this prevent ziti agent stats to find its process?

bye, Chris

qrkourier commented 3 months ago

Yes, I think the read only root might have prevented placing the IPC socket used by the probe. It will be more elegant to use the controller and router HTTP health checks, but there's a little work needed to template those on a separate web listener's binding. That way they'll be published to the kubelet but not to all IP addresses, especially avoiding publishing them with the client API, because they could represent a vector for DoS.

marvkis commented 3 months ago

What is the path to the IPC socket? We could mount a tmpfs for the IPC to have an initial solution before migrating to the HTTP checks.

qrkourier commented 3 months ago

That might work. I've only observed the IPC socket in paths like this: /tmp/gops-agent.{pid}.sock.

marvkis commented 3 months ago

That was it. I just added

additionalVolumes: 
- name: tmp
  volumeType: emptyDir
  mountPath: /tmp

and the health checks started working. What do you think? Should we add an emptyDir to /tmp by default or should we keep it this way? At least I would extend it to limit the size of the emptyDir sizeLimit: 1Mi. Call me paranoid, but just in case something 'unexpected' happens and someone finds a way to upload data to /tmp - the OOMKiller would kill the pod and we would be informed that something is going on here...

qrkourier commented 3 months ago

It seems reasonable for us chart maintainers to define a non root emptyDir volume for temp, or to find a way to influence the socket path and ensure a writable volume is available there. I'll support using /tmp until we have a reason not to.