moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.37k stars 616 forks source link

FIPS support design #2544

Closed cyli closed 6 years ago

cyli commented 6 years ago

https://github.com/docker/swarmkit/pull/2535 starts to add fernet encryption to support a FIPS-compliant raft encryption algorithm. https://github.com/docker/swarmkit/pull/2246 has already added the PKCS8 key encryption format, which is FIPS-compliant, which will store keys in a new format, but be backwards compatible with older versions of swarm, unless FIPS mode is enabled via environment variable, in which case it will enforce that all keys MUST use the new encryption format.

However while working on this, I realized it may not make sense to have a mixed cluster of FIPS and non-FIPS nodes, if what you desire is FIPS compliance. Therefore, it may make sense to specify an entire cluster as requiring FIPS-compliance. To that end, this is the design document as to how that will work. The assumption is that the docker engine can be run in FIPS mode or not, based on an env variable toggle, or some other toggle. We want to prevent someone from accidentally restarting a cluster node in non-FIPS mode, if the cluster is mandatory FIPS-compliant.

The proposal is to:

  1. Add a boolean to the Cluster object, FIPS, that specifies whether the cluster is FIPS enabled or not. This is set when the cluster is first created, and can't be changed afterward, because it definitely does not make sense to go from non-FIPS to FIPS, since that means that it's possible you have some old raft data lying around that is not FIPS compliant. For simplicity, I suggest we don't allow migration from FIPS->non-FIPS as well, although loosening compliance requirements is easier than tightening them.

  2. Add a boolean to the NodeDescription object, FIPS, that specifies whether a given node is FIPS enabled or not. Agents will have to self-report their FIPSness - we have no way of enforcing that they are FIPS compliant. But the dispatcher will terminate connections from an agent that is not FIPS compliant.

1. When a manager starts up, as soon as it loads the cluster object, it checks to see if the cluster requires FIPS and ensures that it is running in FIPS mode. If not, the manager will refuse to complete starting up.

1. The swarm token version is bumped to indicate the FIPSness of the cluster. If a node is not running in FIPS mode, and it is told to join a FIPS cluster, it will refuse to join. So the swarm token will now look like SWMTKN-2-<0/1 FIPS>-<root digest>-<secret>.

  1. Add a field to the TLS certificate (maybe an invalid DNS SAN? Maybe overload one of the subject fields?) to indicate FIPSness. When a node starts up, it matches this field against it's configured FIPSness, and errors if there's a mismatch.

  2. Add an extra check in all TLS connections for FIPSness

The massive branch that does all if this is at https://github.com/cyli/swarmkit/tree/fernet-encryption-inprocess.

  1. I also plan on changing the existing FIPS code to not be based on an env var, but to take a config value that is propagated through the necessary components of node. This makes it easier to test mixed clusters. This is a large-ish change, though: https://github.com/docker/swarmkit/pull/2562

Question: FIPS seems to be one possible axis of compliance. Are we going to want to enforce compliance along some other axis, and prevent nodes from being able to join the cluster? If so, would it make sense to have a compliance object, which includes FIPS, as opposed to just a bool?

Thoughts? @docker/core-swarmkit-maintainers @docker/security-team @stevvooe

Additional TODOS:

cyli commented 6 years ago

Ok, I'm dumb. For #3/4, we can probably just add an extra invalid SAN domain to the TLS cert, and then terminate the connection if another cert doesn't have the same SAN field. That would be better than waiting for the agent to connect, and it would make sure the agent and manager terminate as soon as it loads the TLS cert if it's not running in FIPS mode. I'll have to look into where to add it though.

This means that the swarm token probably doesn't have to change, either.

cyli commented 6 years ago

Note to self: also handle the case where a user gives us a PKCS1 key for root rotation - we should immediately convert it to PKCS8 if FIPS is enabled.

cyli commented 6 years ago

This issue has been resolved by: