It is possible to bypass the queue-proxy

mikehelmick commented 4 years ago

/area networking

What version of Knative?

v0.11.0-125-gb8f7090cc

Expected Behavior

From within the cluster, it shouldn't be possible to connect directly to the user container and bypass the queue proxy for a revision.

I would expect that this would be prohibited, possibly via network policy.

Actual Behavior

Port 8080 of the user container is exposed and available

Note: this only works if the revision is scaled to 1 or more instances already

Steps to Reproduce the Problem

Deploy a knative service ("webapp" in my example)

get the PodIP

% kubectl get pods/webapp-mrpn8-deployment-6559dcff9b-c2pxx -oyaml  | grep "podIP:"
  podIP: 10.36.0.28

Able to (from on cluster) connect directly to port 8080 on that pod

# curl http://10.36.0.28:8080        
<html>
<head>
  <title>Hello there...</title>
</head>

evankanderson commented 4 years ago

Something like the following NetworkPolicy might work (untested):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: knative-serving-user-container-policy
  namespace: default
spec:
  podSelector:
    matchExpressions:
      - key: serving.knative.dev/revision
        operator: Exists
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchExpression:
        - key: serving.knative.dev/release
          operator: Exists
      podSelector:
        matchLabels:
          app: activator
    ports:
    - protocol: TCP
      port: queue-port

(and a similar NetworkPolicy to allow traffic from the HTTP load-balancers, which would be network stack specific)

/kind good-first-issue

evankanderson commented 4 years ago

Note that NetworkPolicies are additive, so it should be possible to apply the above NetworkPolicy as well as a NetworkPolicy that allows it.

Note that there are no good namespace labels for selection for at least the default suggested Istio installation....

evankanderson commented 4 years ago

Ref: https://kubernetes.io/docs/concepts/services-networking/network-policies/

JRBANCEL commented 4 years ago

Is it something we want to enforce? We could simply document it: tell customer to not open a socket on all interfaces but just on loopback. It could be in the spec along with listening on the port provided by the PORT env variable.

i.e.

http.ListenAndServe("localhost:8080", nil)

instead of

http.ListenAndServe(":8080", nil)

And we should start with our samples: https://github.com/knative/docs/blob/master/docs/serving/samples/hello-world/helloworld-go/helloworld.go#L30.

Code I used to validate what I described above: https://github.com/JRBANCEL/Experimental/tree/master/IsolateContainerInsidePod

evankanderson commented 4 years ago

Requiring all containers to be conscious of network security seems like a problem, particularly if users end up reusing common containers (e.g. hugo for serving a website with the website layer added to a base hugo image).

mpetason commented 4 years ago

/assign mpetason

mpetason commented 4 years ago

@evankanderson I'm going to try this issue and see if I can resolve it. Would the first steps be to get this working for default then discuss the best way to select namespaces that should be covered? If we weren't using additional namespaces after installation then we could require it in the config, but users will probably create additional namespaces to use besides default. I'm getting up to speed on how some of this works and then will be seeing if I can get it to block in default first (should be easy with the examples) then maybe push a PR to discuss additional namespaces.

markusthoemmes commented 4 years ago

Is this only surfaced because the user-port is part of the containerPorts list in K8s? What happens if we remove it from that list during deployment? Is the port still surfaced?

markusthoemmes commented 4 years ago

Tried it, even if we don't surface it, 8080 is curlable.

JRBANCEL commented 4 years ago

Tried it, even if we don't surface it, 8080 is curlable.

Yes, it doesn't matter. Only listening to loopback prevents external connections to be established. See my comment above.

vagababov commented 4 years ago

well, or a network rule, as Evan mentioned (definitely for meshes).

mpetason commented 4 years ago

I did some testing but wasn't able to find where the traffic comes from after it gets cut over from the activator. I tested out a few different sources, but the main one I tested was istio-ingressgateway.

Here's the Network Policy that I thought would work:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: knative-serving-user-container-policy-lb
  namespace: default
spec:
  podSelector:
    matchExpressions:
      - key: serving.knative.dev/revision
        operator: Exists
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: istio-system
    - podSelector:
        matchLabels:
          app: istio-ingressgateway
    ports:
    - protocol: TCP
      port: queue-port

Activator Traffic

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: knative-serving-user-container-policy
  namespace: default
spec:
  podSelector:
    matchExpressions:
      - key: serving.knative.dev/revision
        operator: Exists
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchExpressions:
        - key: serving.knative.dev/release
          operator: Exists
      podSelector:
        matchLabels:
          app: activator
    ports:
    - protocol: TCP
      port: queue-port

Test steps - changed target-burst-capacity to 0 so the activator would be removed from the path after there was a revision. (Victor recommended changing minScale=1 instead) Applied both network policies. First hit works like it should because of the activator networkpolicy, then the pod scales to 1 and traffic fails. I was thinking that the istio networkpolicy should allow traffic, but it fails with "upstream connect error or disconnect/reset before headers. reset reason: connection failure".

I checked a few different logs but didn't see anything useful. I might not be looking in the right place.

Any ideas on where traffic is coming from after the revisions scale?

vagababov commented 4 years ago

To be precise: Victor recommended changing minScale=1 instead: in addition to tbc=0 apply minScale=1

mpetason commented 4 years ago

I asked internally and had a response that might work out as well -

"for this use case wouldn't just allow-from-all to queuePort work - i.e. no need for a from: .. labels at all? that'd block all the other ports, I think, and let queuePort only through - iiuc we're ok with direct access to queuePort from anywhere so that seems to do what we want?"

mpetason commented 4 years ago

Tested this out using a policy that only specifies allowable ports, but allows traffic from anywhere.

Looks like it works when hitting the endpoint from a browser.

It fails when trying to curl from another pod in the environment using the podIP and 8080.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: knative-serving-user-container-policy
  namespace: default
spec:
  podSelector:
    matchExpressions:
      - key: serving.knative.dev/revision
        operator: Exists
  policyTypes:
  - Ingress
  ingress:
  - from:
    ports:
    - protocol: TCP
      port: queue-port

tcnghia commented 4 years ago

Can we try adding this and run our integration tests? If things work fine, we can add an additional test to hit the queue-proxy directly and confirm failure.

mpetason commented 4 years ago

Since this allows all traffic to the queue-port and only searches for knative revisions, should this be something that is setup during install(config file) instead of for each revision?

tcnghia commented 4 years ago

Yes, this looks like a one-time setup during install, so no need to do it per Revision. Which is also great because no code change, and users that dislike the restriction could easily remove it themselves (and potentially replaced with something else to their liking).

mpetason commented 4 years ago

The networkpolicy will be applied for default but it doesn't cover other namespaces. We have a couple options that I can think of, please feel free to add any others I might not have thought of.

Use this during the original setup and cover the default namespace. Add documentation explaining how this works and recommend that the user applies it to other namespaces.
Apply the networkpolicy per revision so that it will cover all namespaces. Probably have to turn this into an option eventually in case end users want to turn it off.

Looking for feedback/ideas about how we want this to work.

mpetason commented 4 years ago

Another option - one NetworkPolicy per namespace. When creating a new revision check if a networkpolicy is setup for the namespace, if there isn't one then create a new one.

mpetason commented 4 years ago

I'll be working on the one network policy per namespace idea for now. The PR is WIP currently.

bbrowning commented 4 years ago

A big -1 on adding default NetworkPolicy objects out of the box.

In my experience, it's quite hard to come up with an appropriate NetworkPolicy out of the box that won't have unwanted side-effects. Today, with us shipping no NetworkPolicy objects, users can install their own if they want a Knative Service to be isolated from specific other namespaces, specific other pods, or so on.

In the linked PR #7700, for example, it adds a NetworkPolicy that allows any traffic from any pod or any namespace to the queue-proxy port of every Knative Revision pod. So, with that in place, how would a user now decide that certain Knative Revision pods should only be accessible from specific other pods? Specific other namespaces? By trying to lock down thing so you can only hit the queue-proxy (a relatively minor problem), we've opened up a much bigger can of worms by impact the ability for users to actually restrict traffic to their Knative Services as needed.

mpetason commented 4 years ago

@bbrowning I agree, it will be more difficult to control traffic. Maybe we can get more feedback on how we want to approach this. It was suggested that we document this but also that maybe we are proactive and force it.

If we document it then we could have a sample network policy that will block all traffic except for what is destined for queue-proxy.

I'm open to anything. I'm looking for issues to work on to help :P

bbrowning commented 4 years ago

Perhaps @mikehelmick or someone else can elaborate on why a user shouldn't be able to bypass the queue-proxy. Is it a concern around bad actors? Accidental misuse of Serving? Conformance and exposed API of Serving? Is there a legitimate reason someone may want to bypass the queue-proxy? That kind of information would help determine whether this is something we can ignore, something we should document and tell users why they may want to configure their cluster in this way, or something we should enforce and need to find a solution with fewer side-effects than a NetworkPolicy like this.

vagababov commented 4 years ago

The main problem is that we can't properly scale if there's traffic going past QP.

duglin commented 4 years ago

FWIW, knowing that everything goes thru the QP is helpful to ensure that we can do pre & post message processing if needed w/o asking the user to do anything special in their code - or even be aware of it. This gives us a place to add hooks to do some additional infrastructure logic (ie. magic) for the user in our attempt to make their life easier.

bbrowning commented 4 years ago

I agree that for scaling purposes, the primary traffic needs to flow through the queue-proxy. And, it does today. A user has to intentionally try to bypass the internal or external URL of their Knative Service and directly hit a Knative Revision pod to bypass the queue-proxy.

The open question is whether we should prohibit users from being able to do that. You have to specifically go out of your way to bypass the queue-proxy and hit a Revision pod directly, so is it safe to assume someone doing that has a specific reason they want/need to in a specific circumstance? Or do we have a gap somewhere where users are accidentally sending traffic directly to a Revision pod?

mpetason commented 4 years ago

So what do we want to do about this? Should I add this to a serving meeting agenda to go over? It's not an urgent issue but it seems like there has been more discussion around it recently.

mpetason commented 4 years ago

I went ahead and closed the PR I had open to address this issue as it doesn't seem like we have an agreement on what we want to do.

tcnghia commented 4 years ago

cc @igorbelianski

markusthoemmes commented 3 years ago

@bbrowning's standpoint seems reasonable to me. This is only an "attack vector" from inside the cluster itself. If you don't trust your network inside of the cluster, you'd have to go setup appropriate NetworkPolicies I suppose but I tend to agree that it's not generically an issue :thinking:

mpetason commented 3 years ago

/unassign

Removing myself from the issue for now as I haven't been as active after the initial work. If we decide to move forward maybe someone more active can pick it up. (I've been focusing on docs/cli)

markusthoemmes commented 3 years ago

My 2c here: @bbrowning's comment is correct and @JRBANCEL's "workaround" is imo a good thing to call out in documentation if this is important to users.

dprotaso commented 3 years ago

Moved to icebox for now - I think there's a class of related issues (ie. queue proxy's prom endpoints) and it's probably important to at least offer guidance (via documentation). I think if the Security WG group could offer recommendations on best practices that would be great (cc @julz, @evankanderson)

evankanderson commented 3 years ago

/assign

/triage accepted

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

viveksahu26 commented 1 year ago

Hey @evankanderson, I am familiar with terminology used in this issue like network policy, networking in kubernetes and go etc. . I would love to get started with it. Can you guide me how to get started with it?

evankanderson commented 1 year ago

Hello!

There are a few parts to this problem:

We will want to create a single set of NetworkPolicies for each namespace that has Routes or Revisions (not sure which, either seems reasonable), rather than many policies per namespace. There may be some similar code for managing secrets for the activator -> queue-proxy encrypted path.
Different clusters may use different ingress plugins (Istio, contour, kourier, gateway API). We'll need some way to accommodate the ingress -> activator data path. It may be sufficient to have a config value for this which defaults to "all the likely namespaces".
We need a way to roll this out to existing clusters in a hitless way -- we want the upgrade for existing users to have zero impact. (The bar here for Knative is higher than the k8s upgrade bar.)
In service of the above, you may want to put the code behind a feature flag to start with. This can also help you defer needing to figure out which integration test environments we have use a CNI which enforces NetworkPolicies.

viveksahu26 commented 1 year ago

Something like the following NetworkPolicy might work (untested):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: knative-serving-user-container-policy
  namespace: default
spec:
  podSelector:
    matchExpressions:
      - key: serving.knative.dev/revision
        operator: Exists
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchExpression:
        - key: serving.knative.dev/release
          operator: Exists
      podSelector:
        matchLabels:
          app: activator
    ports:
    - protocol: TCP
      port: queue-port

(and a similar NetworkPolicy to allow traffic from the HTTP load-balancers, which would be network stack specific)

/kind good-first-issue

Hey @evankanderson, I am trying to understand this policy. Do correct if I am wrong understanding it. So basically it says that: "Allow pod with label, app: activator from namespace having label serving.knative.dev/release to communicate with pod having label serving.knative.dev/revision in default namespace ruining at Port queue-port"

evankanderson commented 1 year ago

Yes. That comment may have been written before the kubernetes.io/metadata.name label was set on namespaces, so there might be a better way now.

viveksahu26 commented 1 year ago

Hey @evankanderson , finally after lab setup, I am trying to reproduce this issue in a manner:

Deploy this service:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
spec:
  containers:
  - image: docker.io/viveksahu26/helloworld-go
    ports:
    - containerPort: 8080
    env:
    - name: TARGET
      value: "Go Sample v1"

And it is accessible through: curl http://helloworld-go.default.192.168.59.113.nip.io/

Extracted the IP from pod:


kubectl get pods  helloworld-go-00002-deployment-6c5f7cc446-7htc8   -o yaml  | grep "podIP"
      fieldPath: status.podIP
podIP: 172.17.0.17

4. Finally when tried to curl at 8080 using pod IP, getting:

curl http://172.17.0.17:8080 curl: (7) Failed to connect to 172.17.0.17 port 8080 after 3064 ms: No route to host



Hmm, not sure that I am proceeding in right way or not ?

viveksahu26 commented 1 year ago

Hey @evankanderson , to understand this issue I was trying to the understand the traffic from Ingress to container for different cases:

When Instance are not sufficient or not available to handle the request:

In this case the traffics hits the Ingress and Ingress redirect traffics to it's target called "Activator". Now Activator checks and found that no instance or enough instance are available to handle or accomodate those request.. All those traffic or request sent to buffer i.e. queue-proxy. As we know that the main use of queue-proxy is to buffer incoming traffic and manage the flow of traffic when there are no available instances. And once instance becomes available the traffics is forward from queue-proxy to the container.

Second case when there is enough instance to handle the request:
- In this case traffics hits the Ingress and it is forwarded to Activator. Since the instance are sufficient or available to handle that traffic, the Activator will directly send those traffic to container bypassing the queue-proxy. Here this time request or traffics doesn't go via(or through) queue-proxy because the amount of request or traffics incoming can easily accomodate in available instances. Therefore no need to save those request in buffer, i.e. queue-proxy. So, finally traffics directly sent from Activator to container bypassing queue-proxy.

If till here I am correct then look forward with the issue and it's solution.

evankanderson commented 1 year ago

This is not quite right; there are three cases, but the queue-proxy, which runs as a per-pod sidecar, is part of the data path in all cases. We use the queue-proxy for the following capabilities:

Measure active concurrent requests per instance (including for metrics)
Enforce containerConcurrency, an upper bound on the number of active requests per container

The three cases are:

A request comes in, and there are no pods running. In this case, we route the request to the activator, which stalls the request and starts scaling up the pods from zero.
A request comes in when there are many pods. In this case, the ingress routes directly to the queue-proxy, bypassing the activator.
Knative scales targeting a per-instance utilization of 70%. For large services, it's easy to reach state 3. For small services, if current capacity - current usage is less than target burst capacity, the activator remains in the request path, as it can often make better balancing decisions with a small number of activators (using subsetting) than a large number of ingresses.

The data path for 1 and 3 is the same; it's only for large numbers of concurrent requests that state 2 occurs.

vagababov commented 1 year ago

To be precise. You can also force (2) by setting TBC to be 0 — then as soon as there is one pod all requests will be routed directly to the service. Or exclude it altogether by setting it TBC=-1 then regardless of scale activators will always be in the request path.

viveksahu26 commented 1 year ago

Thanks @evankanderson for correcting. For the time being was trying to understand:

How Ingress knows where to route? Activator or queue-proxy ? If routed to Activator, then who update Ingress routing to queue-proxy when sufficient pods are available ?
And let say for the case when no pod is present and ingress routed to pod, then why it doesn't shows an error to user browser like status 404 because ingress is trying to access that endpoint which is not up.

AFAIK,

A client sends an HTTP request to the Knative Serving application using its external address. The request is received by the Knative Serving Ingress.
The Ingress extract info like path and according to it's rule it route the request. And it routes to revision i.e. service endpoint.
But unfortunately there were no pod, so Ingress will finally route the request to the Activator. The activator then queues the request as a buffer in queue-proxy , and by the time Activator with the help of Autoscaler and Knative serving will scale up the number of pods as needed to handle the request. And once the Activator determines that there are available pods to handle the request, it forwards the request to the Queue Proxy, which then routes the request to one of the available pods for processing.
And as the subsequent requests or more request made by the end user to the Ingress then it will directly route to the Queue Proxy. Question : here is who updated Ingress routing because just previously it was set to Activator ?

Bisht13 commented 1 year ago

Hey @evankanderson 👋 , I'm here for GSoC'23 and saw this idea posted. I'm really intrigued by the problem and possible solution. I'd want to work on it ✨

davidhadas commented 1 year ago

How about Guard (or queue) checking if the user-container listens on the POD-IP:USER-CONTAINER-PORT and raise an alert? (I.e. not raise the alert when it listens to 127.0.0.1:USER-CONTAINER-PORT)

davidhadas commented 1 year ago

From where I stand, if we define for the user to use a specific port, we may as well define to use localhost. But we do need to monitor and alert when it is not done correctly.

We could/should monitor for other ports the container listens to.... they are as problematic as the one we use to communicate with that user-conatiner.

We might go all the way to decide to not accept a service (or block it or whatever) if it has an illegal open port. However, It seems too drastic and I see no reason why to go there - so alerting seems appropriate.

knative / serving