knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.57k stars 1.16k forks source link

Add DataPlan-Trust implementation for Activator #13968

Closed davidhadas closed 11 months ago

davidhadas commented 1 year ago

See serving #11906 See https://docs.google.com/document/d/1XE7UzgQlVVtAb7ULSqOyKCaIHtm8zMF35ainp1JmwyY/

This issue focuses on adding DataPlan-Trust support for Activator and Queue including options for: dataplane-trust = "minimal" (common names for all namespaces) dataplane-trust = "enabled" (per namespace) dataplane-trust = "mutual" mTLS

It includes the necessary changes needed for:

  1. QP Server will present the DataPlane User Certificate with names "data-plane.knative.dev" and "kn-user-\<namespace>"
  2. Activator Client will always present the data plane certificate with the name "kn-routing-0"
  3. If dataplane-trust = "minimal", Activator Client will verify server certificate has the name "data-plane.knative.dev" otherwise, Activator Client will verify server certificate has the name "kn-user-\<namespace>"
  4. Activator Server will present the DataPlane Routing Certificate with the name "kn-routing-0"
  5. If dataplane-trust = "mutual", Activator Server will verify the Client certificate having the name "kn-routing-0" Until such time that all ingresses use the new DataPlane Routing certificate, we should also accept "data-plane.knative.dev"
dprotaso commented 1 year ago

As an operator why would I choose minimal over enabled?

davidhadas commented 1 year ago

As an operator why would I choose minimal over enabled?

Minimal equals what we have had as "internal-encryption" (which we deprecated). This is the first TLS support we will be able to release. Once we complete the work for Enabled (all ingress types etc.), enabled will be recommended to replace it.

nak3 commented 1 year ago

You will add dataplane-trust = "identity" as well, won't you?

davidhadas commented 1 year ago

You will add dataplane-trust = "identity" as well, won't you?

The code also seems to support what we need for a future dataplane-trust = "identity" since identity adds on top of mutual trust. The main difference between the two appears at the ingress. However, the focus now is on TLS and verifying identities at each hop.

The suggested wip #13969 treats:

So identity and mutual are the same at this point when we make changes to the activator and queue

dprotaso commented 1 year ago

Minimal equals what we have had as "internal-encryption" (which we deprecated).

Given internal encryption is 'alpha' I wonder if we skip supporting Minimal in our redux. It might be better to just remove it now.

@nak3 do you have input on this feature's usage

nak3 commented 1 year ago

Yeah, I think it is alright to replace current internal-encryption with enabled if the new configuration will be implemented smoothly.

davidhadas commented 1 year ago

@nak3

Yeah, I think it is alright to replace current internal-encryption with enabled if the new configuration will be implemented smoothly.

I can't see a way to ensure that the new configuration will be implemented smoothly at this point. Minimal is our only option for enabling support without mandating an activator on-route for all requests.

Enabled without mandating an activator on-route is not something I expect to be available any time soon.

Moving the original internal-encryption forward (now renamed as Minimal) will allow us to introduce end-to-end encryption and move it from Alpha to Beta as soon as the PRs for activator and queue are approved. The other trust levels will evolve at a later time.

nak3 commented 1 year ago

If the new configuration will not be implemented smoothly, I think we should keep Minimal.

With said, for moving it from Alpha to Beta, we will add Enabled level configuration as per https://github.com/knative/serving/issues/11906 ? We will need to re-discuss about Beta criteria if we change the criteria.

davidhadas commented 1 year ago

With said, for moving it from Alpha to Beta, we will add Enabled level configuration as per #11906 ? We will need to re-discuss about Beta criteria if we change the criteria.

11906 was initially planned to be implemented using internal-encryption which is now called Minimal.

11906 specifically requires support for "Ingress -> Queue-Proxy" path, which we can only achieve using Minimal

Support for Minimal is therefore in line with the defined criteria, removing "Minimal" means we no longer meet the criteria indicated in #11906.

nak3 commented 1 year ago

Sorry I referred to this specific comment https://github.com/knative/serving/issues/11906#issuecomment-1415654516

evankanderson commented 1 year ago

With said, for moving it from Alpha to Beta, we will add Enabled level configuration as per #11906 ? We will need to re-discuss about Beta criteria if we change the criteria.

When you say Enabled level configuration, do you mean enabling encryption by default? I think that's a good target for Beta, though some KIngress implementations (gateway API) may need to disable for now.

nak3 commented 1 year ago

Sorry I misunderstood about Implement SNI in activator to avoid activator always in path. David explained about it and it is clear now.

evankanderson commented 1 year ago

We talked a bit about this in the security WG meeting. To summarize, we believe that all the following are true:

  1. The current implementation is the minimal version, which uses the same certificate across all namespacse. We may want to pick a different name that's slighly less confusing, something like shared or bootstrap.
    • This is weak against an attacker which has read access to Secrets in any Kubernetes namespace AND network-level write access.
  2. There are two feature additions we need to reach the "GA" security bar:
    • high-volume: Removing the activator from the request path when space capacity is greater than target-burst-capacity. Activator bypass is currently disabled when TLS is enabled.
    • enabled: Changing the expected SAN for connections between components, so that requests for a service in a given namespace expect that namespace in the subject of the next-hop certificate. These two options interplay with each other -- for enabled and high-volume, we either need ingresses to support requesting multiple subjects, or to implement SNI on the activator (or to generate a huge number of subjects on the activator cert).

Given that we're currently publishing support for the minimal level, we need a path forward with compatibility for any existing adopters. The security WG is recommending that we do the following upgrade path, probably in separate releases:

minimal -> enabled -> high-volume

Once enabled has been rolled out, we should be able to retire the minimal feature in the following release, possibly coordinated with the release of the high-volume feature.

Does that make sense? Is there other data that we're missing?

davidhadas commented 1 year ago

To clarify, the recommendation for 2023, coming out of the recent Security WG meeting is:

  1. We should ship Trust=Minimal while supporting high-volume as our default GA - making this the default for Serving is a huge step forward in the security offered to the Knative community. Trust=Minimal is only slightly less secure than Trust=Enabled and on the other hand, offers a significant advantage over the current Non-TLS support.
  2. For community users that are not using high-volume, we recommend to also support Trust=Enabled as a configuration option. When used, this option should disable high-volume support. Both the recommended default (Trust=Minimal with high-volume), and the more secure option (Trust=Enabled without high-volume) are within reach in the near future and can be GAed in 2023.

As a future plan, we recommend to continue making an effort to support Trust=Enabled with high-volume support. Such work may depend on support from Ingress GW API and/or other external dependencies which are unlikely to be resolved at any near future. Hence such support may or may not be available in 2024/5/6/... Even without such work, the move to end to end encryption with Trust=Minimal is a significant step forward in the security offered to the community.

dprotaso commented 1 year ago

Given that we're currently publishing support for the minimal level, we need a path forward with compatibility for any existing adopters.

This is where I disagree. Only Kourier supports the internal encryption feature. Contour has code but we have zero test coverage. This feature is alpha and we have affordances to forgo the current implementation and skip this complex migration. Additionally, there is no documentation on the website about this feature.

high-volume

I'm confused is this a configuration option the user sets explicitly?

This seems unnecessary for Istio and Kourier because they can support multiple SANs for a backend the activator should just move in and out of the data path when Trust=enabled

For GatewayAPI & Contour we should default the activator to always be in the data path when trust=Enabled, document this clearly on the website, and then enable the activator movement transparently for the user when those implementations support multiple SAN in the future.

I'm not convinced that we should support trust=Minimal in order to have 'high-volume' support - we have two ingress implementations end-users can use to achieve their security & volume requirements.

davidhadas commented 1 year ago

As indicated above the Security WG opted to make TLS our default sooner rather than later (and for very good reasons). From our analysis, the path for getting to TLS as the Serving default is using Minimal. Also, we see the need to support Minimal for many years to come and do not see an option to have any other TLS default for Serving.

@dprotaso (and please correct me if I am misinterpreted your thoughts) clearly prefers to drop Minimal and have TLS supported with Enabled. @dprotaso sees Minimal as a transient phase that is confusing and need to be dropped.

I think we all agree on the facts (I listed everything I could collect below), and that we have two options:

Option A

Option B


I believe we all agree on the many items below (please speak out if not):

  1. Minimal is dramatically more secure than non-TLS option. It offers data encryption in-transit and prevents MITM for an attacker that can manipulate the cluster network but have no access to any of the cluster secrets.
  2. Enabled is more secure than Minimal as it prevents a certain attack vector where the attacker has access to one secret in one namespace and can manipulate the the cluster network to perform MITM attack. (With Enabled the attacker needs access to the specific secret on the namespace being attacked).
  3. Closing the gaps to reach a GA with Minimal is doable in a relatively short time, and is a manageable effort - (seems as if we could have been there already if it was not for the disagreement as for the path forward).
  4. Minimal supports high-volume for all ingresses - The term high-volume refers to the case where we do not mandate activator in path for all requests, and instead support direct communication between the ingress and queue.
  5. We do not identify anything preventing us to move Minimal to GA and make it the default for Serving.
  6. Support for Enabled with high-volume is apparently doable at present in Kourier and Istio but we did not POC it and it represent another gap we need to close to reach GA with Enabled
  7. Support for Enabled with high-volume is apparently not doable in Contour, Ingress GW any time soon. Instead what we will be required to do is to enforce all requests to go via the activator for these ingresses.
  8. It will probably not make sense to use Enabled as our default before we can solve how we can support high-volume for all ingresses - (not an option I expect we will have in the next 1,2,3... years). This means we will continue to use non-TLS as our default.
  9. We started with the implementation of a number of trust levels that include Minimal, Enabled, Mutual, ... and merged a number of PRs in a couple of packages. We also created #14063 and #13969 for Serving for that purpose.

Still, although we seem to all agree on the above list, we disagree on the question of what should we do.

I am opting for a decision (one way or another) such that I can clean my desk form this effort as I am gradually focusing more and more on Zero-Trust aspects of K8s and Knative.

@evankanderson, @dprotaso, @nainaz @ReToCode @skonto @nak3, @rhuss, @psschwei

dprotaso commented 1 year ago

We do not identify anything preventing us to move Minimal to GA and make it the default for Serving. ... I am opting for a decision (one way or another) such that I can clean my desk form this effort as I am gradually focusing more and more on Zero-Trust aspects of K8s and Knative.

I've been pretty clear that we should drop Minimal for the past few months now. I don't understand why we're ignoring my feedback and revisiting this discussion every two weeks.

Closing the gaps to reach a GA with Minimal is doable in a relatively short time, and is a manageable effort - (seems as if we could have been there already if it was not for the disagreement as for the path forward).

Like I stated above - Enabled should be fairly easy to support today since multiple upstream SAN are supported by Kourier & Istio. Users would then have two options to run Knative securely and where the activator doesn't need to be in the data plane (aka. high-volume).

For Contour/Gateway API we can wait until their API can support such a change. It would be great if folks could engage/discuss/contribute in those respective projects and communities.

davidhadas commented 1 year ago

I've been pretty clear that we should drop Minimal for the past few months now. I don't understand why we're ignoring my feedback and revisiting this discussion every two weeks.

@dprotaso - No-one ignores your feedback. In parallel, no-one seem to be able to make sense as for why Option A is your choice over Option B - I have talked to almost anyone related to this effort to try to figure this out.

Since this is going nowhere, lets close the 2 respective PRs, consider it a POC that others can use as reference future Serving TLS support.

dprotaso commented 1 year ago

no-one seem to be able to make sense as for why Option A is your choice over Option B

What's not clear about my position?

I've stated previously:

  1. Supporting 'Enabled' using multiple SANs should be straight forward to do in Kourier and Istio
  2. Given 1) The extra work/testing/maintenance to support 'Minimal' seems unnecessary
  3. Given 1) Why introduce and default to 'Minimal' but have our docs recommend using 'Enabled' or higher? It's going to be confusing for end-users/admins.
  4. Please work upstream in the respective projects Contour/Gateway API to provide a path for 'Enabled' to be supported (via multiple SAN support)
davidhadas commented 1 year ago

Enabled is not suitable to be used as Serving default before all ingresses can support Enabled with high-volume. It is very possible it will never be suitable as the default (e.g. if Gateway API will not support it). This is why Security WG chose after much deliberation to keep the Minimal option.

Minimal in many ways was designed as a subset of Enabled, same as Enabled being a subset of Mutual and Mutual being subset of Identity etc. - the claim about the extra work/testing/maintenance to support Minimal is not accurate and at the same time one can make similar arguments to remove Enabled and make Mutual the default, or maybe remove Mutual also and make Identity (which is not yet designed) to be the default.

It boils down to 'A Bird in the Hand is Worth Two in the Bush'.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.