openshift / enhancements

Enhancements tracking repository for OKD
Apache License 2.0
178 stars 461 forks source link

CoreOS Encrypted Disks By Default doc is not clear enough for installer changes #98

Closed abhinavdahiya closed 3 years ago

abhinavdahiya commented 4 years ago

The installer team was looking at implementing the https://github.com/openshift/enhancements/blob/a3411e6f3458743ee2f84b013101d584fc272dc8/enhancements/automated-policy-based-disencryption.md#installer-support section, but the section is very brief in details that would allow somebody to implement the requested feature.

Here are some of the high-level questions that probably should be answered..

A) The installer can only provide the configurastion for nodes in form of MachineConfig objects. Therefore it would be highly useful there were example for MachineConfig objects that would define the encryption setting:

i) default (disable: false, enforce: true) ii) tpm2 based iii) tang based, multiple tang servers based iv) custom user based

B) The specs allow tpm2, tang etc. as source for encryption setup source, but there are no links or definition of valid values for these options.

C) The spec says the default is disable: false, enforce: true

that's not a backward compatible change for install-config.yaml users, because users today expect to have no encryption...?

D) lack of clarity for default on cloud platforms.\

https://github.com/openshift/enhancements/blob/b5e77b5a99dc19de9acfa27fb0758ca42d74f3ee/enhancements/automated-policy-based-disencryption.md#policies

is also not clear on the defaults for cloud like AWS, Azure, GCP..

sdodson commented 4 years ago

/assign @ashcrow

ashcrow commented 4 years ago

/assign @darkmuggle

darkmuggle commented 4 years ago

@abhinavdahiya I think that we need to step back.

Reading between the lines I think the confusion comes from you looking for what hasn't yet been defined (KSM Cloud) or declarative API's (when its commandline + CLI).

Let's have higher-bandwidth discussion to ensure that you have what you need.

cgwalters commented 4 years ago

I think probably we should remove the encryption bits from install config and have users supply MachineConfig instead.

(Which then pushes this discussion to what one can do via MCs)

cgwalters commented 4 years ago

Putting this here for lack of a better place: It turns out that at least some vSphere installs use the "metal" image: https://blog.openshift.com/openshift-4-2-vsphere-install-with-static-ips/

Do we:

I lean a little bit towards the former. It will be a bit rough for <= 4.2 users seeking to create new clusters using the same method, but we will ensure consistency.

cgwalters commented 4 years ago

@patrickdillon :arrow_up:

patrickdillon commented 4 years ago

@jcpowermac, for vSphere expertise

jcpowermac commented 4 years ago

A couple things concern me about enabling it on vSphere:

  1. The customer must have KMS setup to use a vTPM. We have no CI infrastructure with a KMS (at least in packet)
  2. The prerequisite text states: "The guest OS you use must be either Windows Server 2016 (64 bit) or Windows 10 (64 bit)." We need clarification from VMware if other OSes are supported.
  3. ESXi 6.7 and later only - 6.5 is supported to 2021

With those requirements I think that default disabled is more appropriate.

cc: @dav1x

cgwalters commented 4 years ago

OK, we will change the encryption policy to require that no known virtualization is detected.

One tangentially related question...so I was trying to test out forcing on encryption in qemu with cosa run and this ignition config:

jq . ~/src/github/cgwalters/playground/ignition/rhcos-encrypt-tpm.json
{
  "ignition": {
    "version": "2.3.0"
  },
  "storage": {
    "files": [
      {
        "path": "/etc/clevis.json",
        "filesystem": "root",
        "mode": 420,
        "contents": {
          "source": "data:,%7B%7D%0A"
        }
      }
    ]
  }
}

But no luck... I think there's some other bug in the code?

darkmuggle commented 4 years ago

OK, we will change the encryption policy to require that no known virtualization is detected.

One tangentially related question...so I was trying to test out forcing on encryption in qemu with cosa run and this ignition config:

jq . ~/src/github/cgwalters/playground/ignition/rhcos-encrypt-tpm.json
{
  "ignition": {
    "version": "2.3.0"
  },
  "storage": {
    "files": [
      {
        "path": "/etc/clevis.json",
        "filesystem": "root",
        "mode": 420,
        "contents": {
          "source": "data:,%7B%7D%0A"
        }
      }
    ]
  }
}

But no luck... I think there's some other bug in the code?

The default was to use base64 encoiding, which you fixed.

cgwalters commented 4 years ago

One other detail I've been thinking about here - should we disable encryption for the bootstrap node by default?

Or (in addition better) the bootstrap node should switch to running from RAM? We could drop an writable overlayfs on /etc and make /var a tmpfs.

darkmuggle commented 4 years ago

One other detail I've been thinking about here - should we disable encryption for the bootstrap node by default?

Or (in addition better) the bootstrap node should switch to running from RAM? We could drop an writable overlayfs on /etc and make /var a tmpfs.

I like this option better. I would assume the goal is to make the bootstrap node start a bit faster?

cgwalters commented 4 years ago

I would assume the goal is to make the bootstrap node start a bit faster?

Yes, but also clarify things better in the case where e.g. one wants to use a Tang server. In that case, it might be surprising to an admin that the bootstrap node was TPM bound (if on metal) or not encrypted at all (other cases).

Basically the goal here is to avoid needing to worry about encryption for the bootstrap.

cgwalters commented 4 years ago

Here's an example MachineConfig to skip encryption: https://github.com/cgwalters/playground/blob/master/machineconfigs/no-metal-encrypt/no-encrypt.yaml

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 50-no-encrypt-master
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,Cg==
        filesystem: root
        mode: 0644
        path: /etc/rhcos-no-clevis

(And you'll likely want the same for workers)

cgwalters commented 4 years ago

Basically the goal here is to avoid needing to worry about encryption for the bootstrap.

OK yeah this turns out to be a real problem; we were testing out latest RHCOS in Packet.net and the host just had a TPM 1.2 device, and we bombed out. This should have been a lot more obvious to debug.

If we change the installer to avoid encrypting the bootstrap, we can at least detect the case where the bootstrap doesn't meet requirements and provide the user with a much clearer error. (This doesn't help if the bootstrap and target hosts aren't homogenous, but one step at a time)

cgwalters commented 4 years ago

And yeah, today to avoid the encryption on the bootstrap one would need to hand-edit the output of openshift-install create ignition-configs for the bootstrap and append the /etc/rhcos-no-clevis stamp.

cgwalters commented 4 years ago

Here's my proposal for a new revision to the encryption enhancement's install-config.yaml proposal:

osEncryption: platform|tpm2|disabled

platform being the default which means today "on bare metal, require a tpm2".

One reason to support this in the installer is (as noted above) to allow ergonomic control over the bootstrap node as well particularly to disable.

This also punts Tang out into a separate case - I think it's actually OK if people wanting that provide MachineConfig objects directly. But I do want to make it very easy and obvious to do tpm2 binding.

Thoughts?

cgwalters commented 4 years ago

Just linking for xref: https://github.com/coreos/ignition/issues/585 would be useful for encryption in particular.

cgwalters commented 4 years ago

Forcing on TPM2

Here's an example MachineConfig to force on a TPM2 requirement (likely only useful if you're e.g. using libvirt to test out a bare metal install and are using a software test TPM):

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 50-force-master-tpm2
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;base64,e30K
        filesystem: root
        mode: 0644
        path: /etc/clevis.json

Using NBDE/Tang

To use network bound disk encryption/Tang - create a clevis.json that looks like this:

{
 "url": "https://tang.example.com",
 "thp": "<insert signing key>"
}

And provide it as a MachineConfig as above.

You must also enable networking on the kernel commandline. If you're using the default of DHCP, in that same MachineConfig, set e.g.:

kernelArguments:
  - rd.neednet
  - ip=dhcp

More information about the dracut networking options can be found in man dracut.cmdline.

abhinavdahiya commented 4 years ago

So we have examples for

a) no encryption: https://github.com/openshift/enhancements/issues/98#issuecomment-557319361

is the file name required to be rhcos-no-clevis ??

b) tpm2 https://github.com/openshift/enhancements/issues/98#issuecomment-558808162 Forcing on TPM2

c) tang https://github.com/openshift/enhancements/issues/98#issuecomment-558808162 Using NBDE/Tang

https://github.com/openshift/enhancements/blob/a3411e6f3458743ee2f84b013101d584fc272dc8/enhancements/automated-policy-based-disencryption.md#installer-support talks about multple tang URLs, how is that should be rendered as machineconfig.

Also do we have information about expectations regarding thp's content ?

cgwalters commented 4 years ago

is the file name required to be rhcos-no-clevis ??

Yes. (I debated having /etc/clevis.json as an empty file instead, but that felt strange since it wouldn't be valid JSON)

talks about multple tang URLs, how is that should be rendered as machineconfig.

See the RHEL doc:

Clients should be configured with the ability to bind to multiple Tang servers. In this setup, each Tang server has its own keys and clients are able to decrypt by contacting a subset of these servers. Clevis already supports this workflow through its sss plug-in.

I think my proposal for now is per above - punt on trying to include Tang in installconfig; users who want it can provide MachineConfig directly. We'll circle back to how/whether we support multiple Tang.

(Actually the way this would work is via sss - but yeah, we need to write an example of this)

Also do we have information about expectations regarding thp's content ?

Per the doc it's: " thp: The thumbprint of a trusted signing key"

Note all of the docs around setting up a Tang server apply equally well to traditional RHEL and OpenShift/RHCOS.

cgwalters commented 4 years ago

That all said...one thing to consider at least is whether openshift-install should support a high level opinionated workflow that e.g. sets up a "recovery/vault server" separate from the cluster (and supports re-using an existing recovery server). This way we could easily automate doing Tang using either the recovery server or the cluster itself, and not require admins to learn about Tang.

(You could imagine that a recovery server could extend to backing up the kubeadmin credentials, etc.)

jstuever commented 4 years ago

There were a few different directions discussed here. Can we get this all solidified and added to the enhancement doc for clarity?

abhinavdahiya commented 4 years ago

There were a few different directions discussed here. Can we get this all solidified and added to the enhancement doc for clarity?

@jstuever i'm not sure which topic are are talking about.. sadly there are 3 different ones: exmaples, install-config fields, handling vaults

jstuever commented 4 years ago

All of the above, a single source of truth would be considerably helpful as opposed to continually digesting these comments.

abhinavdahiya commented 4 years ago

All of the above, a single source of truth would be considerably helpful as opposed to continually digesting these comments.

a) i think the enhancement should probably have the examples and the examples are pretty clear https://github.com/openshift/enhancements/issues/98#issuecomment-558812971

b) As for install-config, those probably don't belong in the enhancement. So i think we should move that to installer github issue or PR directly.

c) For the handling of vault, let's just skip that as that's not required as of now.

@jstuever if you can open a PR updating the enhancement for covering a) b) I can help review and get it merged for source of truth

jstuever commented 4 years ago

Draft, open to feedback.... https://github.com/openshift/installer/pull/2809

steffencircle commented 4 years ago

I came here as i was re-searching options for doing an SSS based "clevis luks bind" setup for our OpenShift 4.4 deployments.

We do have the requirement to bind against multiple Tang-Servers for High-Availabiloty reasons with a key-threshold of 2.

Is this possible at all at the moment (We are actively using that on our RHEL deployments) and if so can you please provide an example clevis.json

Thx

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/enhancements/issues/98#issuecomment-749072850): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.