spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.8k stars 476 forks source link

Spire-Agent is failing to rotate its SVID due to Missing KeyManager directory #5488

Open Krannthi opened 2 months ago

Krannthi commented 2 months ago

We noticed that due to missing KeyManager directory on the host, spire-agent is failing to rotate its keys.

To avoid this we could add a check in the disk keymanager 'writeEntries' method: https://github.com/spiffe/spire/blob/main/pkg/agent/plugin/keymanager/disk/disk.go#L108

amartinezfayo commented 2 months ago

We can have the check in the Configure function, so we can fail the agent startup if the directory does not exist. This is really a pre-requisite for the plugin to work.

Krannthi commented 1 month ago

That's true but we saw the issue at runtime when the directory path gets deleted for some reason. Our agents were healthy but couldn't rotate key due to missing directory.

Krannthi commented 1 month ago

@amartinezfayo Do we want to address both checks in this issue.

zmt commented 3 weeks ago

@amartinezfayo Do we want to address both checks in this issue.

  • Check directory exists in plugin Configure function, do we wanna fail in this case or just create ? I think we should attempt to create dir if missing and fail if unable to create the configured data dir during Configure. Reasoning about Validate API complicates it a bit:
  • would we want the same stat check and create attempt or fail on Validate?
  • might be unexpected that Validate could have a side effect on the file system
  • might be just as unexpected for Validate to succeed, but Configure fail b/c lack of perms to create and/or write to data dir
  • would we want to fail Validate if the dir is missing?
  • Create directory if doesn't exist on key writes
  • what if the dir exists, but perms changed or filesystem is read-only at write?
  • what if we can't create the dir?

I think the answer to these 2 questions is log error (also emit metric?) and crash agent. I would argue that is better than quietly failing to rotate keys, but it is a different trade-off. Maybe we do want to keep failure to rotate keys behavior, but make more noise about it?

amartinezfayo commented 3 weeks ago

I personally don't think that the key manager should be creating directories. If the key manager is unable to rotate the key (due to an error writing the key to disk or other reason), I think that we should keep the current logic of retrying with backoff, and not just crash, because it could be recoverable. I agree that emitting a metric will help to better observe what's happening. We can add that.