kubeflow / dashboard

Kubeflow Central Dashboard is the web interface for Kubeflow
Apache License 2.0
3 stars 2 forks source link

profiles: Work with InferenceServices #13

Open yanniszark opened 3 years ago

yanniszark commented 3 years ago

/kind bug

What steps did you take and what happened:

Deploying an InferenceService in a Profile namespace does not work by default, because:

What did you expect to happen:

Profiles should work with InferenceServices.

I would like to discuss how we should solve this problem and get input from KFServing (cc @kubeflow/wg-serving-leads). For example:

Anything else you would like to add:

@Subreptivus has created a PR extending the profile-controller to apply an Istio policy that allows all requests to /healthz, /metrics paths for KFServing (https://github.com/kubeflow/kubeflow/pull/5848). Is that an ideal solution, given that it affects all Pods and not just KFServing Pods? Can we take a more fine-grained approach and allow only the traffic needed?

cc @kubeflow/wg-notebooks-leads

davidspek commented 3 years ago

I'd just like to point out that the PR by @Subreptivus is also what I found to be necessary for KNative and the KNative docs also say this authorization policy is necessary.

https://knative.dev/v0.22-docs/serving/istio-authorization/#allowing-access-from-system-pods-by-paths

A bit higher on that page it also states that it might be necessary to add the knative-serving namespace to the allow policy for the user namespace.

yuzisun commented 3 years ago

There is discussion on this issue, I think profile controller is the right place to create the authorization policy, the policy can also use the workload label for the fine grain control.

yanniszark commented 3 years ago

Thanks @yuzisun! From issue https://github.com/kubeflow/kfserving/issues/1558, I understand that ss @pvaneck mentioned, there is a recommendation from Knative:

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allowlist-by-paths
  namespace: serving-tests
spec:
  action: ALLOW
  rules:
  - to:
    - operation:
        paths:
        - /metrics   # The path to collect metrics by system pod.
        - /healthz   # The path to probe by system pod.

I think we can make this even better by taking advantage of Knative's default labels: https://knative.tips/pod-config/pod-metadata/

Instead of allowing access to all Pods, we can only allow access to Knative Pods. Aka, Pods with the serving.knative.dev/service and serving.knative.dev/revision labels. However, I'm not sure if Istio AuthorizationPolicy can express that right now 🤔 For that reason, I commented in https://github.com/istio/istio/issues/30598 to let upstream istio know that we want this feature.

google-oss-prow[bot] commented 3 weeks ago

@yanniszark: The label(s) kind/bug cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubeflow/dashboard/issues/13): >/kind bug > >**What steps did you take and what happened:** > >Deploying an InferenceService in a Profile namespace does not work by default, because: >- The InferenceService Pod is injected with an Istio sidecar. >- The Istio sidecar does not accept traffic from Knative, which needs to ping the Pod in order to work correctly. > >**What did you expect to happen:** > >Profiles should work with InferenceServices. > >I would like to discuss how we should solve this problem and get input from KFServing (cc @kubeflow/wg-serving-leads). >For example: >- Should we extend KFServing examples to disable sidecar injection by default? >- Should we apply some sort of policy to the Profile namespaces that will make them work well with Knative? If yes, what should this policy be? > >**Anything else you would like to add:** > >@Subreptivus has created a PR extending the profile-controller to apply an Istio policy that allows all requests to `/healthz`, `/metrics` paths for KFServing (https://github.com/kubeflow/kubeflow/pull/5848). Is that an ideal solution, given that it affects all Pods and not just KFServing Pods? Can we take a more fine-grained approach and allow only the traffic needed? > >cc @kubeflow/wg-notebooks-leads Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
yanniszark commented 3 years ago

Testing Knative's proposed policy

After applying Knative's proposed policy, as mentioned by @pvaneck, I confirmed that:

I have also confirmed something that does NOT work:

Cluster-Internal Traffic

In KFServing's current architecture, all cluster-internal traffic must pass through the cluster-local gateway. However, this breaks isolation between tenants, i.e. different namespaces. If I'm not mistaken, KFServing wants to support a mesh-only mode (as Knative supports) and this effort is tracked by: https://github.com/kubeflow/kfserving/issues/1371

@yuzisun could you let me know what is KFServing's advice on this matter?

yanniszark commented 3 years ago

/wait-for-drain path

@Subreptivus found out that there is an additional /wait-for-drain path used by Knative. Knative uses this path to wait until all requests in the queue are complete before terminating a Pod. Istio blocks access to this path, as I confirmed by deleting a Knative Pod while streaming its logs.

Thus, the Knative docs seem incomplete: https://knative.dev/v0.22-docs/serving/istio-authorization/#allowing-access-from-system-pods-by-paths I opened an issue in Knative to track this and ask them about it: https://github.com/knative/docs/issues/3760

danishsamad commented 3 years ago

I have also confirmed something that does NOT work:

  • Users can NOT get KFServing predictions through the cluster-local ingress (cluster-local-gateway). There is no rule allowing traffic FROM the cluster-local-gateway TO Pods in user namespaces.

Is there any recommendation to enable calling inference services internally via cluster local gateway. I have tried a couple of Authorization policies but they don't seem to work.

I keep getting a rbac_access_denied_matched_policy[none] error in istio-proxy container within the inference service pod

-Update

Adding /v1/models/* in the authorization policy allow list solved the problem but I am not sure if this the right approach

davidspek commented 3 years ago

@yanniszark Currently the profile controller adds an allow rule for /healthz, /metrics and /wait-for-drain to allow acces for the KNative activator and controller probes using the following YAML snippet:

- to:
  - operation:
      paths:
      - /healthz
      - /metrics
      - /wait-for-drain

However, this implementation allows for leaking information hosted on these commonly used endpoints (also from other services). As a proof of concept, I tried to curl a notebook server running in a different namespace, which resulted in the expected RBAC: access denied response. Then I tried to curl the notebook server at the /metrics endpoint, which did not result in an RBAC: access denied response. The rule for allowing KNative probes access to the above mentioned paths needs an additional condition to resolve this. The first thing that comes to mind is extending the statement to include a source principle:

- from:
  - source:
      principals: ["cluster.local/ns/knative-serving/sa/controller"]
 to:
  - operation:
    paths:
      - /healthz
      - /metrics
      - /wait-for-drain
davidspek commented 3 years ago

After some testing the above solution will not work, since the knative-serving namespace does not have Istio injection enabled, and a quick test with injection enabled also didn't fix the issue.

thesuperzapper commented 3 years ago

We should also make sure we don't introduce issues for users who are using Kubeflow, but not KFServing for their model deployment. (Therefore may not have KNative)

markwinter commented 3 years ago

To add to my comment https://github.com/kubeflow/kfserving/issues/1558#issuecomment-833271258

I separated the policy into two and limited them to the Predictor and Transformer with a selector like @yuzisun mentioned above

selector:
    matchLabels:
      component: predictor
selector:
    matchLabels:
      component: transformer
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

surajkota commented 2 years ago

linking this PR which should resolve the issue: https://github.com/kubeflow/kubeflow/pull/6013

juliusvonkohout commented 3 weeks ago

/transfer dashboard