microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.63k stars 548 forks source link

Error of job's log-manager and stdout/stderr #4816

Open zlf0625 opened 4 years ago

zlf0625 commented 4 years ago

Organization Name:

Short summary about the issue/question: Cannot access log-manager (through the button browser log folder, error 502 Bad Gateway after about 1-5 minutes of page loading) and stdout/stderr (only showing "Loading..." forever). Found "log-manager-xxx" pods are running correctly. The logs are like x.x.x.x - [17/Aug/2020:03:20:18 +0000] "GET /healthz HTTP/1.1" 200 18 "-" "kube-probe/1.15". All other functionalities seem to work correctly (job submission / running / ...).

Brief what process you are following:

How to reproduce it: not sure (but I reinstalled (after kubeadm reset on all machines) the OpenPAI, and this problem was still there)

OpenPAI Environment:

Anything else we need to know: The firewall is off on all nodes

Binyang2014 commented 4 years ago

Can you access http://work_node_ip:9103/log-manager/ and http://work_node_ip:9103/healthz/ in your browser?

zlf0625 commented 4 years ago

Yes, if I check my master node, http://work_node_ip:9103/log-manager/ shows an empty directory. I have never seen the link with port 9103 before. However, the link from the Browse log folder button is like http://MASTER-NODE-IP/log-manager/WORKER-NODE-IP:9103/MY-USER/... (without :9103, so error 502 happened).

After your reply, I just found that workers' log-manager is not empty and that the link http://WORKER-NODE-IP:9103/log-manager/MY-USER/... does work (substituting MASTER-NODE-IP with WORKER-NODE-IP:9103)...

Why did this happen? How can I fix this?

Binyang2014 commented 4 years ago

Actually, pylon will redirect the http://MASTER-NODE-IP/log-manager/WORKER-NODE-IP:9103/MY-USER/... to http://WORKER-NODE-IP:9103/log-manager/MY-USER/... https://github.com/microsoft/pai/blob/e116f7722fc549ac228bde7620e411a62a9204ce/src/pylon/deploy/pylon-config/location.conf.template#L91-L95

Please make sure the pylon is working

zlf0625 commented 4 years ago

Thanks for the clarification. I saw pod pylon-ds-.... seems to run normally.

http://WORKER-NODE-IP:9103/log-manager/MY-USER/... (redirected) is working. Also, http://MASTER-NODE:9103/healthz/ returns "Log manager ready." and http://MASTER-NODE:9103/ shows Welcome to OpenResty.

Is there anything more I can check (why redirect is not working)?

fanyangCS commented 4 years ago

@Binyang2014, is it k8s CNI issue?

Binyang2014 commented 4 years ago

@zlf0625 Can you docker exec into the pylon container and run curl http://worker_ip:9103. You may need to install curl in pylon container first.

If this command failed, you can try to re-config the pylon daemonSet to use hostNetwork. You can change this in k8s-dashboard.

Add please make sure you can access http://worker_ip:9103 in your master node

zlf0625 commented 4 years ago

I tried docker exec into the pylon container (on the master node) and got curl: (7) Failed to connect to XXX port 9103: Connection timed out (worker IPs). But I can directly connect to the same worker address (with port 9103) (outside the container).

Binyang2014 commented 4 years ago

Can you try to change the pylon-ds to use hostNetwork? Using hostNetwork may solve this issue. And what CNI do you use? Weave or Calico. Please run kuebctl get ds -n kube-system. And make sure you can find weave or calico cni. All the weave/calico related pods should in ready status. And please run kubectl get deployment -n kube-system Make sure you can find coredns services.

zlf0625 commented 4 years ago

I paste the results here first for reference: Running kubectl get ds -n kube-system gets:

NAME                               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
calico-node                        5         5         5       5            5           <none>                        5d2h
k8s-host-device-plugin-daemonset   5         5         5       5            5           <none>                        4d
kube-proxy                         5         5         5       5            5           beta.kubernetes.io/os=linux   5d2h
nvidia-device-plugin-daemonset     5         5         5       5            5           <none>                        4d

Running kubectl get deployment -n kube-system gets:

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
calico-kube-controllers        1/1     1            1           5d2h
coredns                        1/1     1            1           4d2h
dns-autoscaler                 1/1     1            1           5d2h
kubernetes-dashboard-openpai   1/1     1            1           4d

I think I am using the default value (Calico), and coredns is running correctly.

Binyang2014 commented 4 years ago

Can you run kubectl edit ds pylon-ds Then add hostNetwork: true undser sepc.template.spec. It would be like

spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: pylon
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: pylon
      name: pylon
    spec:
      hostNetwork: true
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: pai-master
                operator: In
                values:
                - "true"
      containers:
      - command:
        - /bin/bash
        - /pylon-config/run.sh

I am not sure about the reason for this problem, hope using hostNetwork could mitigate this issue

zlf0625 commented 4 years ago

Wow! Thanks, this fixed the problem. Do you have any idea why this helps?

Just before your comment, I configured the k8s dashboard and modified there, but that editing (in very weird JSON format) didn't work. Your latest comment is very helpful.

Binyang2014 commented 4 years ago

I think it may caused by network plugin (k8s CNI). But I'm not sure about this. If network plugin is enabled, during installation, the iptables of the host will be modified. The network outbound/inbound of the container will be controlled by CNI. I'm not sure about the Calico, but we encounter this issue when using Weave. Some peers don't discover the others. Cause network partition.

Change to use hostNetwork will force container not go through CNI, but though host network

If you want to dig into the problem, you may need to check the iptables of the node. And check the status of the Calico daemon.

fanyangCS commented 4 years ago

for the diagnosis of k8s CNI issue, please refer to the community support of the corresponding k8s CNI (calico in this case). I suspect this has sth. to do with your host environment, e.g., firewall configuration.

zlf0625 commented 4 years ago

I checked that sudo ufw status returns inactive on the master and all workers.

I am not familiar with k8s, and just want to make sure every PAI component is running normally. Also, if I use host network, will it cause other potential problems?

Binyang2014 commented 4 years ago

For PAI Using hostNetwork will not cause potential problem. But it breaks network isolation and may cause some security issues. You can refer to https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces for details.

For CNI issue, you can go to Calico project and ask for help: https://github.com/projectcalico/calico. I think it's not limited to PAI. Every pods running in k8s will suffer this issue,

zlf0625 commented 4 years ago

Thanks! How do I know where is the problem? I think all calico pods seems running normally:

kube-system   calico-kube-controllers-7847896956-8w2zt        1/1     Running   0          5d19h
kube-system   calico-node-49nx6                               1/1     Running   0          5d19h
kube-system   calico-node-9wv9w                               1/1     Running   0          5d19h
kube-system   calico-node-hq8q7                               1/1     Running   0          5d19h
kube-system   calico-node-lp98d                               1/1     Running   1          5d19h
kube-system   calico-node-qwzqn                               1/1     Running   0          5d19h
...

So I even don't know what I should look for

Binyang2014 commented 4 years ago

Here is a troubleshooting guide for calico: https://docs.projectcalico.org/maintenance/troubleshoot/troubleshooting Another way to find the solution is go to https://github.com/projectcalico/calico/issues and find if exists similar issues.

And you can open an issue in calico project, maybe some guys will help you solve the problem.

zlf0625 commented 4 years ago

Hi, I tried to reinstall the entire cluster (K8s + OpenPAI), but it still doesn't work (still "Loading"). The main reason I reinstalled is that: all of our machines will lose Internet connection after several hours (sometimes disconnected at the same time).

All these machines connect to one switch. However, after this disconnection, I can connect to those disconnected machines (disconnected from the Internet or my dev-box) by jumping from still available machines (not experiencing disconnection) using my dev-box machine. I wonder if this is caused by the network plugin?

I also tried to use Weave, but it failed in the installation ("Wait for Weave to become available"). I also tried many other plugins, like flannel (successfully installed but login button no feedback), kube-router (successfully installed but cannot open portal), basic "cni" (status = NotReady, describe node says CNI plugin not ready). I also found I usually received an error in etcd during installation. Do I still have other solutions (like other supported plugins)?

I checked that the firewall is disabled on all servers. I am not sure if I should talk to the network department about the disconnection or potential firewall on their side. I have struggled with this for a long time, and I appreciate any suggestions on this :)

fanyangCS commented 4 years ago

could you remove cni plugin? https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-faqs-and-troubleshooting.html#how-to-remove-k8s-network-plugin

zlf0625 commented 4 years ago

Thanks. I think the page is for weave, but I'm using calico (since I cannot even install weave). How can I uninstall Calico from Kubernetes? I can only found the tutorial for uninstalling calico with kubectl but not ansible-playbook