kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 698 forks source link

TfJobs dashboard doesn't work with K8s API server proxy or envoy proxy #323

Closed jlewi closed 6 years ago

jlewi commented 6 years ago

I have the dashboard running and tried accessing it over the K8s API server proxy. For a brief second I see the dashboard but then I just get a grey screen.

When I connect via kubectl port-forward to the pod it works.

is this expected?

wbuchwalter commented 6 years ago

This is because using kubectl proxy adds a prefix to the path, but the front-end try making request to the root.

see: https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/

Some web apps may not work, particularly those with client side javascript that construct urls in a way that is unaware of the proxy path prefix.

Prometheus and Kashti for example only work with kubectl port-forward. Iirc airflow also has issues with proxy but not port-forward? Would you be ok with explicitly directing toward port-forward in the documentation? If not I can investigate how kubernetes/dashboard solves this.

jimexist commented 6 years ago

FYI i work with Docker for Mac locally, with support for Kubernetes. port-forwarding works for me, haven't tried the other way.

jlewi commented 6 years ago

I think we should try to design our UIs to support having a path prefix. The motivation is that to allow easy secure access from outside the cluster we want to have a single ingress point with a reverseproxy see google/kubeflow#11.

I think kubectl port-forward will be a source of friction so I'd like to avoid it. We will have many UIs (Argo, TFJobs, TensorBoard) etc...

I think an alternative solution would be to give services their own hostname e.g. http://tfjobs-ui.mydomain.com htttp://argo-ui.mydomain.com etc...

I think we can use External DNS but a lot of questions with that approach still remain for me e.g

jlewi commented 6 years ago

So as part of creating a generic reverse proxy solution (google/kubeflow#11) I tried out TB and the dashboard UI behind an Envoy based reverse proxy.

TensorBoard works but the jobs UI doesn't.

My reverse proxy is rooting the app at

https://kubeflow.kubeflow-rl.io/tfjobs/ui/

When I navigate to that page chrome developer console reports a 200 for the main page

Request URL:https://kubeflow.kubeflow-rl.io/tfjobs/ui/
Request Method:GET
Status Code:200 
Remote Address:130.211.28.133:443
Referrer Policy:no-referrer-when-downgrade

But then there are failed requests to some resources because the request path isn't rooted correctly

Request URL:https://kubeflow.kubeflow-rl.io/static/css/main.e1b6790e.css
Request Method:GET
Status Code:404 
Remote Address:130.211.28.133:443
Referrer Policy:no-referrer-when-downgrade
Request URL:https://kubeflow.kubeflow-rl.io/static/js/main.e7655508.js
Request Method:GET
Status Code:404 
Remote Address:130.211.28.133:443
Referrer Policy:no-referrer-when-downgrade

The correct paths should be

https://kubeflow.kubeflow-rl.io/tfjobs/ui/static/css/main.e1b6790e.css
https://kubeflow.kubeflow-rl.io/tfjobs/ui/static/js/main.e7655508.js

Both of which return 200.

I'm guessing this is an issue in the app/html using something like an absolute path as opposed to relative paths.

kflynn commented 6 years ago

What does the Ambassador annotation added for this service look like?

swiftdiaries commented 6 years ago

This is the generated TFJob service YAML:

apiVersion: v1
kind: Service
metadata:
  annotations:
    getambassador.io/config: |-
      ---
      apiVersion: ambassador/v0
      kind:  Mapping
      name: tfjobs-ui-mapping
      prefix: /tfjobs/ui/
      rewrite: /
      service: tf-job-dashboard.kubeflow
  name: tf-job-dashboard
  namespace: kubeflow
spec:
  ports:
  - port: 80
    targetPort: 8080
  selector:
    name: tf-job-dashboard
  type: ClusterIP
kflynn commented 6 years ago

If I'm reading things correctly in the report above, then switching the annotation to use

rewrite: /tfjobs/ui

seems like it should probably fix the issue.

swiftdiaries commented 6 years ago

Let me try it out. Thank you for your comments @kflynn.

kflynn commented 6 years ago

Mea culpa -- left out a /:

rewrite: /tfjobs/ui/
swiftdiaries commented 6 years ago

Ah I made the same mistake, last time as well. This is the corrected service YAML:

apiVersion: v1
kind: Service
metadata:
  annotations:
    getambassador.io/config: |-
      ---
      apiVersion: ambassador/v0
      kind:  Mapping
      name: tfjobs-ui-mapping
      prefix: /tfjobs/ui/
      rewrite: /tfjobs/ui/
      service: tf-job-dashboard.kubeflow
  creationTimestamp: 2018-02-22T21:05:18Z
  name: tf-job-dashboard
  namespace: kubeflow
  resourceVersion: "3670"
  selfLink: /api/v1/namespaces/kubeflow/services/tf-job-dashboard
  uid: 163eda24-1814-11e8-9a3c-025000000001
spec:
  clusterIP: 10.107.237.136
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    name: tf-job-dashboard
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

This is the curl output:

$ curl -v http://localhost:8080/tfjobs/ui/
$ curl -v http://localhost:8080/tfjobs/ui/
*   Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 8080 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /tfjobs/ui/ HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< content-type: text/plain; charset=utf-8
< x-content-type-options: nosniff
< date: Thu, 22 Feb 2018 21:36:29 GMT
< content-length: 19
< x-envoy-upstream-service-time: 5
< server: envoy
< 
404 page not found
* Connection #0 to host localhost left intact
kflynn commented 6 years ago

This is where I suggest dropping into the Ambassador Gitter channel so we can get this sorted out. :)

swiftdiaries commented 6 years ago

Sure thing 👍

kflynn commented 6 years ago

To summarize the conversation from the Ambassador Gitter channel: I misunderstood what @jlewi said a month ago about where the application was expecting to be rooted vs where the Ambassador Mapping was trying to root it. Oops.

Given that the application thinks it's rooted at /, and that it uses absolute paths without any capability for configuration, pretty much the only way to do this is with an Ambassador Mapping that specifies

prefix: /
rewrite: /

If this is the only such application that needs to live behind this Ambassador, that should be enough -- Ambassador knows how to sort Mappings such that very broad matches come later for exactly this reason.

If this is not the only such application, you'll need another constraint in the Mapping. The simplest thing to do is probably to add host, so that only requests to a particular domain name will match (we use this a lot at Datawire, actually).

swiftdiaries commented 6 years ago

Just a thought. We could change the dashboard service slightly. This is from https://github.com/kubeflow/tf-operator/blob/master/dashboard/backend/main.go Current:

p := ":8080"
log.Println("Listening on", p)
http.ListenAndServe(p, nil)

Proposed change:

p := ":8080"
log.Println("Listening on", p+"/tfjobs/ui/")
http.ListenAndServe(p+"/tfjobs/ui/", nil)

And then the ambassador service annotation to be,

prefix: /tfjobs/ui/
rewrite: /tfjobs/ui/

Would this be a viable solution? I don't know about the ramifications of this change. cc/ @Jimexist @wbuchwalter @jlewi

wbuchwalter commented 6 years ago

@swiftdiaries I don't see any issue with your approach, however this would need a bit of testing.

@kflynn

Given that the application thinks it's rooted at /, and that it uses absolute paths without any capability for configuration

I haven't been very active as I am finishing a round of non-stop travel for work, but if you think this issue can be more cleanly solved by modifying the front-end let me know and I will try to find some time to do that.

jlewi commented 6 years ago

@swiftdiaries per @kflynn's comment above

Given that the application thinks it's rooted at /

Can we fix the APP (its our code :)) so it will use the correct root?

I believe we want to configure Ambassador using

prefix: /tf-jobs/ui/
rewrite: /tf-jobs/ui/

If we do this then I'd expect the request that hits the UI has in it somewhere the URL

https://my-domain/tf-jobs/ui/

So we'd like the UI to use https://my-domain/tf-jobs/ui/ as the root for the application and then all paths like "main.e7655508.js" should be computed in terms of that root.

So can we update our REACT app to correctly compute the ROOT and then use that as a prefix for URLs?

swiftdiaries commented 6 years ago

I'll take a look at this.

kkasravi commented 6 years ago

try adding homepage to package.json in tf-operator/dashboard/frontend

{
  "name": "tensorflowk8s-dashboard",
  "version": "0.1.0",
  "description": "Dashboard for kubeflow/tf-operator.",
  "private": true,
  "homepage": ".",
  "dependencies": {

See https://github.com/facebook/create-react-app/issues/527 and https://github.com/facebook/create-react-app/blob/master/packages/react-scripts/template/README.md#building-for-relative-paths

I wanted to test this out myself - i'm a bit unclear how to build/release - it looks like you prep a docker image to do the build and then build within that docker image? The developer instructions seem to assume you're on a linux machine or vm.