smart-edge-open / converged-edge-experience-kits

Source code for experience kits with Ansible-based deployment.
Apache License 2.0
37 stars 40 forks source link

Issues with ovs on dell servers #29

Closed archie95 closed 4 years ago

archie95 commented 4 years ago

I was trying to install OpenNESS 20.06 which was released recently. I tried this on two different kind of servers. There are two setups, one is with server1(node) and other with server(2). In both cases the controller is a VM. Now the issue is in server 1, the installation is successful with all the bridges getting created and are properly working. But the second server(dell 640 power edge) is behaving a little strange. The installation is successful but the brodges are not getting created and are getting destroyed. The ovs bridge is not stable in this one. Most of the times it is unable to find the db.sock which hampers with the pods creation. In the previos versions too we have faced similar problems. The server details are as below 1st Node server details (pizza box server) Manufacturer : Boston supermicro server Product name : sys-6019p-mit Bios revision: 5.14 Vendor: american megatrends inc Os centos 7.6.1810 Hdd 256 gb Ram 64 gb

2nd node server details(dell poweredge 640 series) Dell server Intel xeon gold 6226 processors 2.7 ghz 12 cores 24 threads 19.25m cache, 3.7 max turbo frequency 64 gb , 1tb hdd Quad port 10G nics Os centos7.6.1810

Could someone please help. If more details or logs are required let me know. Thanks

amr-mokhtar commented 4 years ago

Hi @archie95, Can you list all the pods running in the cluster and the logs of the failing pods?

$ kubectl get pods -A -o wide
$ kubectl describe pod <failing-pod-name> -n <namespace>
$ kubectl logs <failing-pod-name> -n <namespace>
mehashu commented 4 years ago

Following are the screenshots from our setup which you asked for:

Image 1 shows the output of command "ovs-vsctl show" on workernode(Supermicro server) where OpenNESS setup is working fine.

Server1 results image 1

Image 2 shows the output of above command on dell server.

server 2 bridge image 2

Image 3 shows the output of "kubectl get pods -A -o wide"

image001 image002 image003 image 3

Image for shows logs for ovs-ovn pod

image004 image 4

Image 5 shows logs for kube-ovn-cni pod

image005 image 5)

archie95 commented 4 years ago

@amr-mokhtar please find above the logs and details from our setup. @mehashu has uploaded them on my behalf. As you can see the ovs is not running at the node end. Even after manually starting it, the liveliness remains for a very short period.

aniket-wipro commented 4 years ago

@amr-mokhtar We were able to bypass the issue on dell servers by disabling the use of kubeovn with dpdk in group_vars/all/10-defaults.yaml. On debugging the issue in ovs, we found certain discrepancies in the logs. We suspect that the issue is regarding dpdk socket mem allocation. Please find attached the ovs-vswitchd.log and the error line.

Screenshot 2020-07-08 at 9 51 51 PM
i-kwilk commented 4 years ago

Hello,

This issue might be caused because of wrong memory settings.

I can see from the logs that your server has 2 NUMA nodes: 2020-07-05T12:08:20.649Z|00015|dpdk|INFO|EAL : Detected 2 NUMA nodes

I am wondering if a configuration in https://github.com/open-ness/openness-experience-kits/blob/master/group_vars/all/10-default.yml#L104 takes this into account.

It should be something like: kubeovn_dpdk_socket_mem: "1024,1024"

Can you please try with the memory set for the second NUMA node?

Thanks, Krzysztof

archie95 commented 4 years ago

Hi Krzysztof,

We tried setting the above configuration as suggested by you but we are still facing the same issue. Please find attached the ovs-vswitchd.log .

image

Regards, Archit

i-kwilk commented 4 years ago

Hi Archit,

Can we ask for all of your configuration files? It might be a risk that there is not enough memory for huge pages, I would recommend setting 2G for each of those 2 NUMA nodes. 2048,2048 instead of 1024,1024.

Thanks, Krzysztof

archie95 commented 4 years ago

Hi Krzysztof,

We tried setting the above parameters as recommended by you, but getting timeout errors due to pending ovs-ovn pods. I will be attaching the configuration files and logs for your perusal.

Regards, Archit

10-default_controller.txt 10-default_edgenode.txt 10-default_all.txt ansible.logs.txt

i-kwilk commented 4 years ago

Hi Archit,

Could you do some changes to your configuration and check if that is fine? Overall let the configuration should be like that:

kubeovn_dpdk_socket_mem: "1024,1024" kubeovn_dpdk_pmd_cpu_mask: "0x4" kubeovn_dpdk_lcore_mask: "0x2" kubeovn_dpdk_hugepage_size: "1Gi" kubeovn_dpdk_hugepages: "4Gi" kubeovn_dpdk_resources_requests: "2Gi" kubeovn_dpdk_resources_limits: "2Gi"

Thanks, Krzysztof

archi-wipro commented 4 years ago

Hi Krzysztof,

Thank you so much for your help. The bridges are getting created now. The issue was because of hugepages only. Using the above configurations we were able to install OpenNESS 20.06 on dell servers. You may please go ahead and close this issue.

Thanks and Regards, Archit