Open jacekwachowiak opened 5 years ago
Hi, I was a little busy yesterday. I will look into it tonight.
Seems like nodes are isolated to each other for some reasons. Are you using useInternalIP=True
? I found one suspicious code that might be a reason for this, but I am not 100% sure yet. I will discuss about it with @simon-mo and reply again in a few days.
Yes, useInternalIP=True
is necessary, otherwise it crashes. It seems that the pods cannot talk between nodes, but yet frontend-query
and metrics
can, being on different nodes.
One more question. Are you providing kubernetes_proxy_addr
?
No. If I define the kubernetes_proxy_addr
Clipper will never stop to initialize and the kubectl proxy
will keep returning http: proxy error: context canceled
.
EDIT: And even if I stop the initialization and continue with the deployment, I am unable to access Prometheus with the address created / Ok, I ignored the address from get_metric_address
, used the node IP and clipper_metric_port
so I can access it again, the kubectl proxy changed nothing it seems then
Currently I am running another, independent Prometheus deployment and service that in theory detects everything and the situation is exactly the same - the model pod is detected on another node but not scraped, the metrics are not 'selectable' and if I run the model on the same node, everything is fine.
Okay. I will write answers for some questions, but I am not 100% sure yet. I will verify all of them within a few days.
1, 2 -> I think this code is the cause. https://github.com/ucbrise/clipper/blob/ddea39d688625e6baed413bb57c32e7fe35fa757/clipper_admin/clipper_admin/kubernetes/kubernetes_container_manager.py#L424
If you use internal Ip without kube proxy, it only adds the last node to the external Ip address. When you call the get_metrics_addr
, it calls the first node address (I suppose it is the master node) of external_Ip_addresses
. That's why you always get the address of the last node's IP address by get_metrics_addr
. I am sure it is a bug, but I will verify if it is the intended behavior.
About 3, 4, and others, Prometheus scarpes metrics from query frontend, not models. Query frontend is deployed with frontend_exporter
, which is the server that uses prometheus client. It calls /metrics endpoint. It should always be able to scrape metrics because frontend exporter and query frontend is in the same pod definition.
I am not quiet sure about why prometheus cannot scrape metrics from different nodes. (especially when kubernetes takes care of service discovery). I will investigate this and get back to you in a few days. Meanwhile can you verify if the namespace and cluster name for all of pods are the same?
I found a couple posts. It might help you debugging this problem.
https://stackoverflow.com/questions/45982488/kubernetes-pods-cannot-find-each-other-on-different-nodes https://github.com/projectcalico/cni-plugin/issues/373
Also, you can probably ask about it in Kubernetes issues as well. Kubernetes setup can be one of the causes for your problem (because I cannot find why prom cannot discover query frontend from other nodes from Clipper code). I will talk about it with @simon-mo this Thursday and figure out if our end has any issue.
Thank you for the links, I will take a look and give you the answer to the namespaces doubt asap. Prometheus pod can detect the frontend pod on different nodes. But then no model metrics can be viewed. Does this mean that Prometheus failing to scrape the model is still ok if it gets everything from the frontend pod?
All pods use the same namespace = default
. What do you mean by cluster name?
I checked each of these pods withkubectl describe pod ...
:
[cloud-user@k1 prom_test]$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metrics-at-default-cluster-7c59547d5f-qbh42 1/1 Running 0 16h 10.233.96.15 node2 <none> <none>
mgmt-frontend-at-default-cluster-7bcc767dc8-grw2g 1/1 Running 0 16h 10.233.96.16 node2 <none> <none>
query-frontend-0-at-default-cluster-d5cc8ddc-zfdqw 2/2 Running 0 16h 10.233.92.22 node3 <none> <none>
redis-at-default-cluster-b564d5c9-mxkj7 1/1 Running 0 16h 10.233.92.21 node3 <none> <none>
sum-model-1-deployment-at-0-at-default-cluster-7c77c65cd7-xnrwm 1/1 Running 0 6m6s 10.233.92.24 node3 <none> <none>
I have created a new cluster with a different network plugin - weave instead of calico and the problem is gone, so it was a Kubernetes network problem, Clipper is working correctly now. I tried using flannel as well but there was another problem. I don't know where exactly the problem was/is, the fix/change is ok for now.
Interestingly enough now Prometheus is accessible even with the incorrect IP - if the Prometheus pod is on x.x.x.1
and the other node is x.x.x.2
, going to x.x.x.2:30xxx
works fine. I will leave this issue open for now, tell me if you want it closed though!
@jacekwachowiak Glad it is resolved! I guess the network was partitioned for some reasons, and there was a problem in inter-node communication. Maybe configuration issue? I will keep it open until I will resolve things that I mentioned above.
Thanks again for raising an issue in detail!
p.s.(Also -at-{something-something}-
part is the cluster name.)
Partially continuing from #687 After the RBAC update I can access Prometheus. There are a few things that are not working as they should:
clipper_conn.cm.get_metric_addr()
. The port is ok, the problem is that the IP returned is always the same (and sometimes incorrect, IP=.33
in my case) - my Kubernetes node3 (out of 3) - could it be that the last IP is somehow always shown? The metrics pod is launched randomly though, so I have to dokubectl describe pod metrics-at-default-cluster-...
to check the node where the pod is and replace whatever IP themetric_addr()
returned. This separately is not a big problem.query-frontend
seems to be always.33
- I will run more times to see if something changes). If that's ok, then please ignore this point.process_*
andscrape_*
if the metric pod and the model pod are on different nodes. If I restart and I am lucky to have the metrics pod on the same IP as the model = everything works fine, which for my 3-node cluster is 50% of the time. I found out that scaling up the replicas will make metrics pod add the Clipper metrics, which probably means that it needs a model pod on the same node - adding just one, that lands on the node with metrics fixes the Prometheus browser but leads to:clipper_mc_pred_total
it detects only the local pod from the same node. Scaling down and eliminating the pod gives 0 records again according to Prometheus but the logs of the model pod on another node are perfectly fine. Update: Images Correctly detected Clipper metrics No metrics detected - model and metrics pod are on different nodes Situation where a replica was added and then removed and added again. Even though the original model pod was working correctly, it was never detected,. The green line shows 10 results, but the model was sent 20.Update: Prometheus shows in targets
Endpoint: DOWN, context deadline exceeded
for one entry - the model pod. Is Clipper using the default values for the scraping time? I think that maybe 5 seconds is not enough. Update2: Changing the timeouts to 30s did not help.Update3: If Prometheus is on another node than the
frontent-query
it still connects well with it. Yet, Prometheus cannot access the/metrics
of a pod that is outside of the node, I tried with running the commands from the Prometheus pod withkubectl exec -it metrics-at-default-cluster-7c59547d5f-mrvmn nc 10.233.92.156 1390
and thenGET /metrics HTTP/1.1
and directly from the console line withcurl -v 10.233.92.156:1390/metrics
. If I curl from the local node's command line I get the metrics without any problem. I am starting to think there must be some authorization/permission problem in the deployment of the models which prohibits communication withexternal
sources.Update4: Is there a way to know how the model's pod handles the request to the
/metrics
endpoint? Is it different than thefrontend-query
's approach? The Kubernetesyaml
s look similar, the port - 1390 is the same, they are both open to Prometheus but maybe there is some difference that makes one open to communication and the other one - model - not?