Prometheus monitoring in K8s has incorrect permissions

paul-crease commented 6 years ago

I am unable to run Clipper on K8s version 1.10.0 (using minikube). Prometheus seems to not have correct permissions.

Method to reproduce: install minikube version: v0.26.1, K8s version 1.10.0 run command to start minikube: minikube start --insecure-registry localhost:5000

run python code to init clipper cluster on K8s:

from clipper_admin import ClipperConnection, KubernetesContainerManager
from subprocess import Popen, PIPE

print("Starting...")
clipper_host_public_ip = Popen(['minikube', 'ip'], stdout=PIPE).communicate()[0].strip()
clipper_conn = ClipperConnection(KubernetesContainerManager(kubernetes_api_ip=clipper_host_public_ip,useInternalIP=True))

clipper_conn.start_clipper(query_frontend_image="clipper/query_frontend",
                           mgmt_frontend_image="clipper/management_frontend")

Expected Result: All components of Clipper are installed, in a running state and queryable

Actual Result: CLI output simply repeats [clipper_admin.py:112] Clipper still initializing.

K8s Logs for the metrics pod logs contain the following

level=error ts=2018-04-20T08:08:53.212624498Z caller=main.go:221 component=k8s_client_runtime
err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:296: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:default:default\" cannot list pods at the cluster scope"

K8s dashboard shows pods are running, but queries hang e.g.

listing apps with

from clipper_admin import ClipperConnection, KubernetesContainerManager
from subprocess import Popen, PIPE

print("Connecting...")
clipper_host_public_ip = Popen(['minikube', 'ip'], stdout=PIPE).communicate()[0].strip()
print("Listing apps...")
clipper_conn = ClipperConnection(KubernetesContainerManager(kubernetes_api_ip=clipper_host_public_ip,useInternalIP=True))
clipper_conn.connect()
print(clipper_conn.get_all_apps())

throws the following error:


8-04-20:10:23:29 WARNING  [kubernetes_container_manager.py:145] No external node addresses found.Using Internal IP address
18-04-20:10:23:29 INFO     [kubernetes_container_manager.py:158] Found 1 nodes: 10.0.2.15
18-04-20:10:23:29 INFO     [kubernetes_container_manager.py:167] Setting Clipper mgmt port to 31184
18-04-20:10:23:29 INFO     [kubernetes_container_manager.py:175] Setting Clipper query port to 32688
18-04-20:10:23:29 INFO     [kubernetes_container_manager.py:185] Setting Clipper metric port to 30510
18-04-20:10:23:29 INFO     [clipper_admin.py:126] Successfully connected to Clipper cluster at 10.0.2.15:32688
Traceback (most recent call last):
  File "list_deployed_apps_kube.py", line 9, in <module>
    print(clipper_conn.get_all_apps())
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/clipper_admin/clipper_admin.py", line 745, in get_all_apps
    r = requests.post(url, headers=headers, data=req_json)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/Users/paulcrease/Documents/python_venv/deep_learning/lib/python2.7/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.0.2.15', port=31184): Max retries exceeded with url: /admin/get_all_applications (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1031bb250>: Failed to establish a new connection: [Errno 60] Operation timed out',))```

simon-mo commented 6 years ago

@paul-crease

For this snippet:

from clipper_admin import ClipperConnection, KubernetesContainerManager
from subprocess import Popen, PIPE

print("Connecting...")
clipper_host_public_ip = Popen(['minikube', 'ip'], stdout=PIPE).communicate()[0].strip()
print("Listing apps...")
clipper_conn = ClipperConnection(KubernetesContainerManager(kubernetes_api_ip=clipper_host_public_ip,useInternalIP=True))
clipper_conn.connect()
print(clipper_conn.get_all_apps())

If I add clipper_conn.start_clipper() before clipper_conn.connect(), (running clipper the first time), I was able to start prometheus in my minikube environment, logging shows:

level=info ts=2018-04-20T09:02:36.385822391Z caller=main.go:585 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-04-20T09:02:36.386613331Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"

Can you try clipper_conn.start_clipper() and see if that will work?

paul-crease commented 6 years ago

@simon-mo - Thanks for the suggestion. I updated the code as suggested but had no luck:

from clipper_admin import ClipperConnection, KubernetesContainerManager
from subprocess import Popen, PIPE

print("Connecting...")
clipper_host_public_ip = Popen(['minikube', 'ip'], stdout=PIPE).communicate()[0].strip()
print("Listing apps...")
clipper_conn = ClipperConnection(KubernetesContainerManager(kubernetes_api_ip=clipper_host_public_ip,useInternalIP=True))
clipper_conn.start_clipper()
clipper_conn.connect()
print(clipper_conn.get_all_apps()

Again I just get the error msg below repeatedly:

18-04-20:11:32:30 INFO     [clipper_admin.py:112] Clipper still initializing.

I also tried to remove clipper completely and then re-run the suggested code, but I get the same result as before for Metrics logging

simon-mo commented 6 years ago

@paul-crease

I tried reproducing the issue with a fresh install of minikube v0.26.1, K8s v1.10.0 on a Mac.

I was able to run

clipper_host_public_ip = Popen(['minikube', 'ip'], stdout=PIPE).communicate()[0].strip()
print("Listing apps...")
clipper_conn = ClipperConnection(KubernetesContainerManager(kubernetes_api_ip=clipper_host_public_ip,useInternalIP=True))
clipper_conn.start_clipper()
clipper_conn.connect()
print(clipper_conn.get_all_apps())

with success. Although it took quite a long time, this is because the local machine is pulling the images. It took ~3 minutes on average download speed of 7-9Mb/s for the fresh install.

You can first run kubectl get pods and then kubectl describe po/{pod-NAME} | tail to see if the pulling is currently happening. For example, I got results like this:

kubectl describe po/metrics-7d577dbc99-b8qtr | tail
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type    Reason                 Age   From               Message
----    ------                 ----  ----               -------
Normal  Scheduled              2m    default-scheduler  Successfully assigned metrics-7d577dbc99-b8qtr to minikube
Normal  SuccessfulMountVolume  2m    kubelet, minikube  MountVolume.SetUp succeeded for volume "config-volume"
Normal  SuccessfulMountVolume  2m    kubelet, minikube  MountVolume.SetUp succeeded for volume "default-token-cgsxx"
Normal  Pulling                2m    kubelet, minikube  pulling image "prom/prometheus:v2.1.0"

It should also shows Waiting: ContainerCreating on Dashboard.

What's the output when you run minikube config view in terminal? I'm wondering if it is possible that any configuration flags were set. Especially for flags like:
- registry
- --bootstrapper=kubeadm
- apiserver.Authorization.Mode=RBAC
Lastly, this might (with low probability) be VirtualBox issue (I noticed the node is 10.0.2.15). If you have docker installed in your machine. Can you try minikube start --insecure-registry localhost:5000 --vm-driver hyperkit? This will use docker's hypervisor to run the cluster, instead of creating the vm in VirtualBox.

Thank you for your patience.

paul-crease commented 6 years ago

Again, thanks for the quick response. I am using OSX v10.11.5

"I'm wondering if it is possible that any configuration flags were set" - I just used defaults, minikube config view returns nothing.

minikube start --insecure-registry localhost:5000 --vm-driver hyperkit - this still results in the same behaviour

kubectl get pods all - I get the same results, everything is Running after a couple of minutes, but the problem persists.

dcrankshaw commented 6 years ago

@chester-leung is going to try to reproduce this on OSX.

chester-leung commented 6 years ago

@paul-crease I'm able to reproduce this on OSX 10.12.6.

I'll look further into this and let you know what I find.

simon-mo commented 6 years ago

@paul-crease Thanks for you patience. @chester-leung and I were able to figure out the issue.

Issue

When minikube runs on VirtualBox, (which is the default option), all the ports are closed. User have to use kubectl proxy to access kubernetes api and all the services.
It shows Clipper is still initializing because it can't access the query frontend.

Solution

Clipper team is working on making clipper compatible with kubernetes proxy in PR #455.
An alternative is to use docker hyperkit as the vm-driver instead of VirtualBox. You need to do the following:
1. Run minikube delete to delete the current kubernetes cluster
2. Follow https://github.com/kubernetes/minikube/blob/master/docs/drivers.md#hyperkit-driver to install hyperkit driver
3. Run minikube start --vm-driver hyperkit to start a brand new minikube cluster with hyperkit.
4. Now this should work with Clipper because the ports are not closed by default.

paul-crease commented 6 years ago

Hello. Thank you for the solution, it now works as expected. Another solution I found was to downgrade minikube to 0.25.1, which then uses K8s v1.9.4 by default. This also then solves the problem.

boyaryn commented 6 years ago

Thanks for this useful topic.

I have a similar issue. I'm installing Clipper (the latest version) on Google Kubernetes Engine. Initially I also was getting Clipper still initializing. in python3 CLI after running clipper_conn.start_clipper() but using kubectl proxy --port 8080 and providing kubernetes_proxy_addr in KubernetesContainerManager's constructor lets it succeed. So for example I can see the registered apps with clipper_conn.get_all_apps().

However, the problem hasn't gone away.

I see a lot of log messages like

2018-07-11 13:24:33.000 EEST
level=error ts=2018-07-11T10:24:33.258312991Z caller=main.go:221 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:296: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:clipper:default\" cannot list pods at the cluster scope: Unknown user \"system:serviceaccount:clipper:default\""

in the metrics container's log output.

In addition, after issuing the command python_deployer.deploy_python_closure(clipper_conn, name="sum-model", version=1, input_type="doubles", func=feature_sum) as described here I see sum-model-1-deployment-at-0-at-tes" deployment with the status 0 of 1 updated replicas available - ImagePullBackOff, and after some time the command fails with the message:

INFO     [clipper_admin.py:474] [test] Pushing model Docker image to test-sum-model:1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/clipper_admin/deployers/python.py", line 222, in deploy_python_closure
    registry, num_replicas, batch_size, pkgs_to_install)
  File "/usr/local/lib/python3.5/dist-packages/clipper_admin/clipper_admin.py", line 352, in build_and_deploy_model
    num_replicas, batch_size)
  File "/usr/local/lib/python3.5/dist-packages/clipper_admin/clipper_admin.py", line 560, in deploy_model
    num_replicas=num_replicas)
  File "/usr/local/lib/python3.5/dist-packages/clipper_admin/kubernetes/kubernetes_container_manager.py", line 393, in deploy_model
    name=deployment_name, namespace=self.k8s_namespace).status.available_replicas \
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 5758, in read_namespaced_deployment_status
    (data) = self.read_namespaced_deployment_status_with_http_info(name, namespace, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 5843, in read_namespaced_deployment_status_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/api_client.py", line 342, in request
    headers=headers)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/usr/local/lib/python3.5/dist-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': '3e877d67-962f-40bc-82b9-0733c5a4bbe5', 'Date': 'Tue, 10 Jul 2018 23:29:06 GMT', 'Content-Length': '129', 'Content-Type': 'application/json', 'Www-Authenticate': 'Basic realm="kubernetes-master"'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}

Could you suggest how to solve this?

ucbrise / clipper

Prometheus monitoring in K8s has incorrect permissions #477

Issue

Solution