[Metrics] Prometheus does not always detect Clipper metrics

jacekwachowiak commented 5 years ago

Partially continuing from #687 After the RBAC update I can access Prometheus. There are a few things that are not working as they should:

I am using the method clipper_conn.cm.get_metric_addr(). The port is ok, the problem is that the IP returned is always the same (and sometimes incorrect, IP=.33 in my case) - my Kubernetes node3 (out of 3) - could it be that the last IP is somehow always shown? The metrics pod is launched randomly though, so I have to do kubectl describe pod metrics-at-default-cluster-... to check the node where the pod is and replace whatever IP the metric_addr() returned. This separately is not a big problem.
Another thing is that the metrics pod and the other Clipper pods are not started at the same node (which I don't know why for query-frontend seems to be always .33 - I will run more times to see if something changes). If that's ok, then please ignore this point.
The most important one - when I go to the browser to Prometheus I cannot see the metrics related to Clipper, only a few generic ones like process_* and scrape_* if the metric pod and the model pod are on different nodes. If I restart and I am lucky to have the metrics pod on the same IP as the model = everything works fine, which for my 3-node cluster is 50% of the time. I found out that scaling up the replicas will make metrics pod add the Clipper metrics, which probably means that it needs a model pod on the same node - adding just one, that lands on the node with metrics fixes the Prometheus browser but leads to:
Prometheus count is not correct - for the clipper_mc_pred_total it detects only the local pod from the same node. Scaling down and eliminating the pod gives 0 records again according to Prometheus but the logs of the model pod on another node are perfectly fine. Update: Images Correctly detected Clipper metrics No metrics detected - model and metrics pod are on different nodes Situation where a replica was added and then removed and added again. Even though the original model pod was working correctly, it was never detected,. The green line shows 10 results, but the model was sent 20.

Update: Prometheus shows in targets Endpoint: DOWN, context deadline exceeded for one entry - the model pod. Is Clipper using the default values for the scraping time? I think that maybe 5 seconds is not enough. Update2: Changing the timeouts to 30s did not help.

Update3: If Prometheus is on another node than the frontent-query it still connects well with it. Yet, Prometheus cannot access the /metrics of a pod that is outside of the node, I tried with running the commands from the Prometheus pod with kubectl exec -it metrics-at-default-cluster-7c59547d5f-mrvmn nc 10.233.92.156 1390 and then GET /metrics HTTP/1.1 and directly from the console line with curl -v 10.233.92.156:1390/metrics. If I curl from the local node's command line I get the metrics without any problem. I am starting to think there must be some authorization/permission problem in the deployment of the models which prohibits communication with external sources.

Update4: Is there a way to know how the model's pod handles the request to the /metrics endpoint? Is it different than the frontend-query's approach? The Kubernetes yamls look similar, the port - 1390 is the same, they are both open to Prometheus but maybe there is some difference that makes one open to communication and the other one - model - not?

rkooo567 commented 5 years ago

Hi, I was a little busy yesterday. I will look into it tonight.

rkooo567 commented 5 years ago

Seems like nodes are isolated to each other for some reasons. Are you using useInternalIP=True? I found one suspicious code that might be a reason for this, but I am not 100% sure yet. I will discuss about it with @simon-mo and reply again in a few days.

jacekwachowiak commented 5 years ago

Yes, useInternalIP=True is necessary, otherwise it crashes. It seems that the pods cannot talk between nodes, but yet frontend-query and metrics can, being on different nodes.

rkooo567 commented 5 years ago

One more question. Are you providing kubernetes_proxy_addr?

jacekwachowiak commented 5 years ago

No. If I define the kubernetes_proxy_addr Clipper will never stop to initialize and the kubectl proxy will keep returning http: proxy error: context canceled. EDIT: And even if I stop the initialization and continue with the deployment, I am unable to access Prometheus with the address created / Ok, I ignored the address from get_metric_address, used the node IP and clipper_metric_port so I can access it again, the kubectl proxy changed nothing it seems then

jacekwachowiak commented 5 years ago

Currently I am running another, independent Prometheus deployment and service that in theory detects everything and the situation is exactly the same - the model pod is detected on another node but not scraped, the metrics are not 'selectable' and if I run the model on the same node, everything is fine.

rkooo567 commented 5 years ago

Okay. I will write answers for some questions, but I am not 100% sure yet. I will verify all of them within a few days.

1, 2 -> I think this code is the cause. https://github.com/ucbrise/clipper/blob/ddea39d688625e6baed413bb57c32e7fe35fa757/clipper_admin/clipper_admin/kubernetes/kubernetes_container_manager.py#L424

If you use internal Ip without kube proxy, it only adds the last node to the external Ip address. When you call the get_metrics_addr, it calls the first node address (I suppose it is the master node) of external_Ip_addresses. That's why you always get the address of the last node's IP address by get_metrics_addr. I am sure it is a bug, but I will verify if it is the intended behavior.

About 3, 4, and others, Prometheus scarpes metrics from query frontend, not models. Query frontend is deployed with frontend_exporter, which is the server that uses prometheus client. It calls /metrics endpoint. It should always be able to scrape metrics because frontend exporter and query frontend is in the same pod definition.

I am not quiet sure about why prometheus cannot scrape metrics from different nodes. (especially when kubernetes takes care of service discovery). I will investigate this and get back to you in a few days. Meanwhile can you verify if the namespace and cluster name for all of pods are the same?

rkooo567 commented 5 years ago

I found a couple posts. It might help you debugging this problem.

https://stackoverflow.com/questions/45982488/kubernetes-pods-cannot-find-each-other-on-different-nodes https://github.com/projectcalico/cni-plugin/issues/373

Also, you can probably ask about it in Kubernetes issues as well. Kubernetes setup can be one of the causes for your problem (because I cannot find why prom cannot discover query frontend from other nodes from Clipper code). I will talk about it with @simon-mo this Thursday and figure out if our end has any issue.

jacekwachowiak commented 5 years ago

Thank you for the links, I will take a look and give you the answer to the namespaces doubt asap. Prometheus pod can detect the frontend pod on different nodes. But then no model metrics can be viewed. Does this mean that Prometheus failing to scrape the model is still ok if it gets everything from the frontend pod?

jacekwachowiak commented 5 years ago

All pods use the same namespace = default. What do you mean by cluster name? I checked each of these pods withkubectl describe pod ...:

[cloud-user@k1 prom_test]$ kubectl get po -o wide
NAME                                                              READY   STATUS    RESTARTS   AGE    IP             NODE    NOMINATED NODE   READINESS GATES
metrics-at-default-cluster-7c59547d5f-qbh42                       1/1     Running   0          16h    10.233.96.15   node2   <none>           <none>
mgmt-frontend-at-default-cluster-7bcc767dc8-grw2g                 1/1     Running   0          16h    10.233.96.16   node2   <none>           <none>
query-frontend-0-at-default-cluster-d5cc8ddc-zfdqw                2/2     Running   0          16h    10.233.92.22   node3   <none>           <none>
redis-at-default-cluster-b564d5c9-mxkj7                           1/1     Running   0          16h    10.233.92.21   node3   <none>           <none>
sum-model-1-deployment-at-0-at-default-cluster-7c77c65cd7-xnrwm   1/1     Running   0          6m6s   10.233.92.24   node3   <none>           <none>

jacekwachowiak commented 5 years ago

I have created a new cluster with a different network plugin - weave instead of calico and the problem is gone, so it was a Kubernetes network problem, Clipper is working correctly now. I tried using flannel as well but there was another problem. I don't know where exactly the problem was/is, the fix/change is ok for now. Interestingly enough now Prometheus is accessible even with the incorrect IP - if the Prometheus pod is on x.x.x.1 and the other node is x.x.x.2, going to x.x.x.2:30xxx works fine. I will leave this issue open for now, tell me if you want it closed though!

rkooo567 commented 5 years ago

@jacekwachowiak Glad it is resolved! I guess the network was partitioned for some reasons, and there was a problem in inter-node communication. Maybe configuration issue? I will keep it open until I will resolve things that I mentioned above.

Thanks again for raising an issue in detail!

p.s.(Also -at-{something-something}- part is the cluster name.)

ucbrise / clipper

[Metrics] Prometheus does not always detect Clipper metrics #716