Open zlf0625 opened 4 years ago
Can you access http://work_node_ip:9103/log-manager/
and http://work_node_ip:9103/healthz/
in your browser?
Yes, if I check my master node, http://work_node_ip:9103/log-manager/
shows an empty directory. I have never seen the link with port 9103
before. However, the link from the Browse log folder
button is like http://MASTER-NODE-IP/log-manager/WORKER-NODE-IP:9103/MY-USER/...
(without :9103
, so error 502 happened).
After your reply, I just found that workers' log-manager is not empty and that the link http://WORKER-NODE-IP:9103/log-manager/MY-USER/...
does work (substituting MASTER-NODE-IP
with WORKER-NODE-IP:9103
)...
Why did this happen? How can I fix this?
Actually, pylon will redirect the http://MASTER-NODE-IP/log-manager/WORKER-NODE-IP:9103/MY-USER/...
to http://WORKER-NODE-IP:9103/log-manager/MY-USER/...
https://github.com/microsoft/pai/blob/e116f7722fc549ac228bde7620e411a62a9204ce/src/pylon/deploy/pylon-config/location.conf.template#L91-L95
Please make sure the pylon
is working
Thanks for the clarification. I saw pod pylon-ds-....
seems to run normally.
http://WORKER-NODE-IP:9103/log-manager/MY-USER/...
(redirected) is working. Also, http://MASTER-NODE:9103/healthz/
returns "Log manager ready." and http://MASTER-NODE:9103/
shows Welcome to OpenResty
.
Is there anything more I can check (why redirect is not working)?
@Binyang2014, is it k8s CNI issue?
@zlf0625 Can you docker exec
into the pylon container and run curl http://worker_ip:9103
. You may need to install curl in pylon container first.
If this command failed, you can try to re-config the pylon daemonSet to use hostNetwork
. You can change this in k8s-dashboard.
Add please make sure you can access http://worker_ip:9103
in your master node
I tried docker exec
into the pylon container (on the master node) and got curl: (7) Failed to connect to XXX port 9103: Connection timed out
(worker IPs). But I can directly connect to the same worker address (with port 9103) (outside the container).
Can you try to change the pylon-ds to use hostNetwork
? Using hostNetwork
may solve this issue. And what CNI do you use? Weave or Calico.
Please run kuebctl get ds -n kube-system
. And make sure you can find weave or calico cni. All the weave/calico related pods should in ready status.
And please run kubectl get deployment -n kube-system
Make sure you can find coredns
services.
I paste the results here first for reference:
Running kubectl get ds -n kube-system
gets:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-node 5 5 5 5 5 <none> 5d2h
k8s-host-device-plugin-daemonset 5 5 5 5 5 <none> 4d
kube-proxy 5 5 5 5 5 beta.kubernetes.io/os=linux 5d2h
nvidia-device-plugin-daemonset 5 5 5 5 5 <none> 4d
Running kubectl get deployment -n kube-system
gets:
NAME READY UP-TO-DATE AVAILABLE AGE
calico-kube-controllers 1/1 1 1 5d2h
coredns 1/1 1 1 4d2h
dns-autoscaler 1/1 1 1 5d2h
kubernetes-dashboard-openpai 1/1 1 1 4d
I think I am using the default value (Calico), and coredns
is running correctly.
Can you run kubectl edit ds pylon-ds
Then add hostNetwork: true
undser sepc.template.spec
. It would be like
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: pylon
template:
metadata:
creationTimestamp: null
labels:
app: pylon
name: pylon
spec:
hostNetwork: true
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: pai-master
operator: In
values:
- "true"
containers:
- command:
- /bin/bash
- /pylon-config/run.sh
I am not sure about the reason for this problem, hope using hostNetwork could mitigate this issue
Wow! Thanks, this fixed the problem. Do you have any idea why this helps?
Just before your comment, I configured the k8s dashboard and modified there, but that editing (in very weird JSON format) didn't work. Your latest comment is very helpful.
I think it may caused by network plugin (k8s CNI). But I'm not sure about this. If network plugin is enabled, during installation, the iptables of the host will be modified. The network outbound/inbound of the container will be controlled by CNI. I'm not sure about the Calico, but we encounter this issue when using Weave. Some peers don't discover the others. Cause network partition.
Change to use hostNetwork will force container not go through CNI, but though host network
If you want to dig into the problem, you may need to check the iptables of the node. And check the status of the Calico daemon.
for the diagnosis of k8s CNI issue, please refer to the community support of the corresponding k8s CNI (calico in this case). I suspect this has sth. to do with your host environment, e.g., firewall configuration.
I checked that sudo ufw status
returns inactive
on the master and all workers.
I am not familiar with k8s, and just want to make sure every PAI component is running normally. Also, if I use host network, will it cause other potential problems?
For PAI Using hostNetwork will not cause potential problem. But it breaks network isolation and may cause some security issues. You can refer to https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces for details.
For CNI issue, you can go to Calico project and ask for help: https://github.com/projectcalico/calico. I think it's not limited to PAI. Every pods running in k8s will suffer this issue,
Thanks! How do I know where is the problem? I think all calico pods seems running normally:
kube-system calico-kube-controllers-7847896956-8w2zt 1/1 Running 0 5d19h
kube-system calico-node-49nx6 1/1 Running 0 5d19h
kube-system calico-node-9wv9w 1/1 Running 0 5d19h
kube-system calico-node-hq8q7 1/1 Running 0 5d19h
kube-system calico-node-lp98d 1/1 Running 1 5d19h
kube-system calico-node-qwzqn 1/1 Running 0 5d19h
...
So I even don't know what I should look for
Here is a troubleshooting guide for calico: https://docs.projectcalico.org/maintenance/troubleshoot/troubleshooting Another way to find the solution is go to https://github.com/projectcalico/calico/issues and find if exists similar issues.
And you can open an issue in calico project, maybe some guys will help you solve the problem.
Hi, I tried to reinstall the entire cluster (K8s + OpenPAI), but it still doesn't work (still "Loading"). The main reason I reinstalled is that: all of our machines will lose Internet connection after several hours (sometimes disconnected at the same time).
All these machines connect to one switch. However, after this disconnection, I can connect to those disconnected machines (disconnected from the Internet or my dev-box) by jumping from still available machines (not experiencing disconnection) using my dev-box machine. I wonder if this is caused by the network plugin?
I also tried to use Weave, but it failed in the installation ("Wait for Weave to become available"). I also tried many other plugins, like flannel (successfully installed but login button no feedback), kube-router (successfully installed but cannot open portal), basic "cni" (status = NotReady, describe node
says CNI plugin not ready). I also found I usually received an error in etcd
during installation. Do I still have other solutions (like other supported plugins)?
I checked that the firewall is disabled on all servers. I am not sure if I should talk to the network department about the disconnection or potential firewall on their side. I have struggled with this for a long time, and I appreciate any suggestions on this :)
Thanks. I think the page is for weave, but I'm using calico (since I cannot even install weave). How can I uninstall Calico from Kubernetes? I can only found the tutorial for uninstalling calico with kubectl
but not ansible-playbook
Organization Name:
Short summary about the issue/question: Cannot access log-manager (through the button
browser log folder
, error 502 Bad Gateway after about 1-5 minutes of page loading) andstdout
/stderr
(only showing "Loading..." forever). Found "log-manager-xxx" pods are running correctly. The logs are likex.x.x.x - [17/Aug/2020:03:20:18 +0000] "GET /healthz HTTP/1.1" 200 18 "-" "kube-probe/1.15"
. All other functionalities seem to work correctly (job submission / running / ...).Brief what process you are following:
How to reproduce it: not sure (but I reinstalled (after
kubeadm reset
on all machines) the OpenPAI, and this problem was still there)OpenPAI Environment:
uname -a
): 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxAnything else we need to know: The firewall is off on all nodes