ucbrise / clipper

A low-latency prediction-serving system
http://clipper.ai
Apache License 2.0
1.4k stars 280 forks source link

Prometheus monitoring in K8s has incorrect permissions #477

Open paul-crease opened 6 years ago

paul-crease commented 6 years ago

I am unable to run Clipper on K8s version 1.10.0 (using minikube). Prometheus seems to not have correct permissions.

Method to reproduce: install minikube version: v0.26.1, K8s version 1.10.0 run command to start minikube: minikube start --insecure-registry localhost:5000

run python code to init clipper cluster on K8s:

from clipper_admin import ClipperConnection, KubernetesContainerManager
from subprocess import Popen, PIPE

print("Starting...")
clipper_host_public_ip = Popen(['minikube', 'ip'], stdout=PIPE).communicate()[0].strip()
clipper_conn = ClipperConnection(KubernetesContainerManager(kubernetes_api_ip=clipper_host_public_ip,useInternalIP=True))

clipper_conn.start_clipper(query_frontend_image="clipper/query_frontend",
                           mgmt_frontend_image="clipper/management_frontend")

Expected Result: All components of Clipper are installed, in a running state and queryable

Actual Result: CLI output simply repeats [clipper_admin.py:112] Clipper still initializing.

K8s Logs for the metrics pod logs contain the following

level=error ts=2018-04-20T08:08:53.212624498Z caller=main.go:221 component=k8s_client_runtime
err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:296: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:default:default\" cannot list pods at the cluster scope"

K8s dashboard shows pods are running, but queries hang e.g.

listing apps with

from clipper_admin import ClipperConnection, KubernetesContainerManager
from subprocess import Popen, PIPE

print("Connecting...")
clipper_host_public_ip = Popen(['minikube', 'ip'], stdout=PIPE).communicate()[0].strip()
print("Listing apps...")
clipper_conn = ClipperConnection(KubernetesContainerManager(kubernetes_api_ip=clipper_host_public_ip,useInternalIP=True))
clipper_conn.connect()
print(clipper_conn.get_all_apps())

throws the following error:


8-04-20:10:23:29 WARNING  [kubernetes_container_manager.py:145] No external node addresses found.Using Internal IP address
18-04-20:10:23:29 INFO     [kubernetes_container_manager.py:158] Found 1 nodes: 10.0.2.15
18-04-20:10:23:29 INFO     [kubernetes_container_manager.py:167] Setting Clipper mgmt port to 31184
18-04-20:10:23:29 INFO     [kubernetes_container_manager.py:175] Setting Clipper query port to 32688
18-04-20:10:23:29 INFO     [kubernetes_container_manager.py:185] Setting Clipper metric port to 30510
18-04-20:10:23:29 INFO     [clipper_admin.py:126] Successfully connected to Clipper cluster at 10.0.2.15:32688
Traceback (most recent call last):
  File "list_deployed_apps_kube.py", line 9, in <module>
    print(clipper_conn.get_all_apps())
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/clipper_admin/clipper_admin.py", line 745, in get_all_apps
    r = requests.post(url, headers=headers, data=req_json)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.0.2.15', port=31184): Max retries exceeded with url: /admin/get_all_applications (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1031bb250>: Failed to establish a new connection: [Errno 60] Operation timed out',))```
simon-mo commented 6 years ago

@paul-crease

For this snippet:

from clipper_admin import ClipperConnection, KubernetesContainerManager
from subprocess import Popen, PIPE

print("Connecting...")
clipper_host_public_ip = Popen(['minikube', 'ip'], stdout=PIPE).communicate()[0].strip()
print("Listing apps...")
clipper_conn = ClipperConnection(KubernetesContainerManager(kubernetes_api_ip=clipper_host_public_ip,useInternalIP=True))
clipper_conn.connect()
print(clipper_conn.get_all_apps())

If I add clipper_conn.start_clipper() before clipper_conn.connect(), (running clipper the first time), I was able to start prometheus in my minikube environment, logging shows:

level=info ts=2018-04-20T09:02:36.385822391Z caller=main.go:585 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-04-20T09:02:36.386613331Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"

Can you try clipper_conn.start_clipper() and see if that will work?

paul-crease commented 6 years ago

@simon-mo - Thanks for the suggestion. I updated the code as suggested but had no luck:

from clipper_admin import ClipperConnection, KubernetesContainerManager
from subprocess import Popen, PIPE

print("Connecting...")
clipper_host_public_ip = Popen(['minikube', 'ip'], stdout=PIPE).communicate()[0].strip()
print("Listing apps...")
clipper_conn = ClipperConnection(KubernetesContainerManager(kubernetes_api_ip=clipper_host_public_ip,useInternalIP=True))
clipper_conn.start_clipper()
clipper_conn.connect()
print(clipper_conn.get_all_apps()

Again I just get the error msg below repeatedly:

18-04-20:11:32:30 INFO     [clipper_admin.py:112] Clipper still initializing.

I also tried to remove clipper completely and then re-run the suggested code, but I get the same result as before for Metrics logging

simon-mo commented 6 years ago

@paul-crease

It should also shows Waiting: ContainerCreating on Dashboard.

Thank you for your patience.

paul-crease commented 6 years ago

Again, thanks for the quick response. I am using OSX v10.11.5

"I'm wondering if it is possible that any configuration flags were set" - I just used defaults, minikube config view returns nothing.

minikube start --insecure-registry localhost:5000 --vm-driver hyperkit - this still results in the same behaviour

kubectl get pods all - I get the same results, everything is Running after a couple of minutes, but the problem persists.

dcrankshaw commented 6 years ago

@chester-leung is going to try to reproduce this on OSX.

chester-leung commented 6 years ago

@paul-crease I'm able to reproduce this on OSX 10.12.6.

I'll look further into this and let you know what I find.

simon-mo commented 6 years ago

@paul-crease Thanks for you patience. @chester-leung and I were able to figure out the issue.

Issue

Solution

paul-crease commented 6 years ago

Hello. Thank you for the solution, it now works as expected. Another solution I found was to downgrade minikube to 0.25.1, which then uses K8s v1.9.4 by default. This also then solves the problem.

boyaryn commented 6 years ago

Thanks for this useful topic.

I have a similar issue. I'm installing Clipper (the latest version) on Google Kubernetes Engine. Initially I also was getting Clipper still initializing. in python3 CLI after running clipper_conn.start_clipper() but using kubectl proxy --port 8080 and providing kubernetes_proxy_addr in KubernetesContainerManager's constructor lets it succeed. So for example I can see the registered apps with clipper_conn.get_all_apps().

However, the problem hasn't gone away.

I see a lot of log messages like

2018-07-11 13:24:33.000 EEST
level=error ts=2018-07-11T10:24:33.258312991Z caller=main.go:221 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:296: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:clipper:default\" cannot list pods at the cluster scope: Unknown user \"system:serviceaccount:clipper:default\""

in the metrics container's log output.

In addition, after issuing the command python_deployer.deploy_python_closure(clipper_conn, name="sum-model", version=1, input_type="doubles", func=feature_sum) as described here I see sum-model-1-deployment-at-0-at-tes" deployment with the status 0 of 1 updated replicas available - ImagePullBackOff, and after some time the command fails with the message:

INFO     [clipper_admin.py:474] [test] Pushing model Docker image to test-sum-model:1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/clipper_admin/deployers/python.py", line 222, in deploy_python_closure
    registry, num_replicas, batch_size, pkgs_to_install)
  File "/usr/local/lib/python3.5/dist-packages/clipper_admin/clipper_admin.py", line 352, in build_and_deploy_model
    num_replicas, batch_size)
  File "/usr/local/lib/python3.5/dist-packages/clipper_admin/clipper_admin.py", line 560, in deploy_model
    num_replicas=num_replicas)
  File "/usr/local/lib/python3.5/dist-packages/clipper_admin/kubernetes/kubernetes_container_manager.py", line 393, in deploy_model
    name=deployment_name, namespace=self.k8s_namespace).status.available_replicas \
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 5758, in read_namespaced_deployment_status
    (data) = self.read_namespaced_deployment_status_with_http_info(name, namespace, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 5843, in read_namespaced_deployment_status_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/api_client.py", line 342, in request
    headers=headers)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': '3e877d67-962f-40bc-82b9-0733c5a4bbe5', 'Date': 'Tue, 10 Jul 2018 23:29:06 GMT', 'Content-Length': '129', 'Content-Type': 'application/json', 'Www-Authenticate': 'Basic realm="kubernetes-master"'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}

Could you suggest how to solve this?