omec-project / sdcore-helm-charts

Helm charts used for SD-Core packaging
7 stars 7 forks source link

UPF - Not able to deploy Kubernetes cluster #49

Closed Puneet1726 closed 1 week ago

Puneet1726 commented 3 weeks ago

Hi Team,

I am deploying UPF to my dev cluster but bessd container is not starting properly and pod will be in PodInitializing status forever.

image

I can see the log from crictl logs on the node.

image

But pod is not giving the correct status. Do I need to configure anything else?

Thanks in advance

gab-arrobo commented 3 weeks ago

How are you deploying the SD-Core? are you directly using the Helm Charts (e.g., helm install ...)? Are you overriding the values.yaml file(s)? Can you please take a look at issue #34 and see if this is related?

Puneet1726 commented 3 weeks ago

I had a look at issue 34 before raising this issue. This seems to be different issue. I am deploying using helm with custom values and pod is getting stuck in "pd initialing stage". Only UPF has an issue and other components are running fine.

I am running on a ubuntu VM and enabled host network on the pod.

I can send the values file if required.

gab-arrobo commented 3 weeks ago

I can send the values file if required.

Yes, please share the values file you are using to override.

Puneet1726 commented 3 weeks ago

values.txt

gab-arrobo commented 3 weeks ago

Can you please share the output of kubectl -n <namespace> describe pod upf-0?

Puneet1726 commented 3 weeks ago

Attached is the pod details poddetails.txt

gab-arrobo commented 3 weeks ago

I am going to try to reproduce the issue later today/tonight using the values file you provided

gab-arrobo commented 3 weeks ago

values.txt

BTW, have you tried to deploy the UPF using the default number of resources (as shown below)? That is, does the machine where you are trying to deploy the UPF have at least 10 cores? Also, why do you need to assign 10 cores if only 1 worker is configured in the upf.jsonc file/setting?

-      cpu:     10
-      memory:  10Gi
+      cpu:     2
+      memory:  2Gi

I am going to try to replicate the issue later today/tonight.

gab-arrobo commented 3 weeks ago

Hi @Puneet1726,

I just tried deploying the SD-Core using your values.txt as reference and everything seems to be working fine. As you can see below, all pods get deployed as expected image

Here are a few comments for your reference:

diff --git a/sdcore-override.yaml b/sdcore-override.yaml
index 235be86..e779fee 100644
--- a/sdcore-override.yaml
+++ b/sdcore-override.yaml
@@ -1,8 +1,8 @@
 images:
-  repository:  #default docker hub
+  repository: "" #default docker hub
   tags:
-    bess: omec/upf-epc-bess:rel-1.4.1
-    pfcpiface: omec/upf-epc-pfcpiface:rel-1.4.1
+    bess: omecproject/upf-epc-bess:rel-1.4.1
+    pfcpiface: omecproject/upf-epc-pfcpiface:rel-1.4.1
     tools: busybox:stable
   pullPolicy: IfNotPresent
   # Secrets must be manually created in the namespace.
@@ -10,7 +10,7 @@ images:
   #  - name: aether.registry

 nodeSelectors:
-  enabled: true
+  enabled: false
   upf:
     label: kubernetes.io/hostname
     value: k8s-upf
@@ -67,30 +67,30 @@ config:
     # Dynamic IP allocation is not supported yet
     # Custom routes inside UPF
     routes:
-      - to: 10.100.2.75/32
-        via: 10.100.0.1
+      - to: 10.154.48.197
+        via: 169.254.1.1
     enb:
-      subnet: 10.100.0.0/16
+      subnet: 192.168.251.0/24
     access:
       ipam: static
       cniPlugin: macvlan
       # Provide sriov resource name when sriov is enabled
       #resourceName: "intel.com/intel_sriov_vfio"
-      gateway: 10.100.0.1
-      ip: 10.100.2.55/24
+      gateway: 192.168.252.1
+      ip: 192.168.252.3/24
       #mac:
       #vlan:
-      iface: enp1s0
+      iface: ens3
     core:
       ipam: static
       cniPlugin: macvlan
       # Provide sriov resource name when sriov is enabled
       #resourceName: "intel.com/intel_sriov_vfio"
-      gateway: 10.100.0.1
-      ip: 10.100.3.55/24
+      gateway: 192.168.250.1
+      ip: 192.168.250.3/24
       #mac:
       #vlan:
-      iface: enp1s0
+      iface: ens3
     cfgFiles:
       upf.jsonc:
         mode: af_packet
Puneet1726 commented 3 weeks ago

I was just playing around with cpu cores. It was 2G only.

I tried with helm dep up. It is not working. I am facing the issue only with UPF remaining components are up and running.

Just wanted to know any kernel module to be loaded for it to work? Also any specific IP ranges we need to use? I have weave as my network for kubernetes cluster.

Can you help me with the basic network configuration required on the node?

Puneet1726 commented 3 weeks ago

I am trying to deploy all the components on a one node cluster using minikube.

5g control planes are deployed successfully and bess was failing with below error.

helm install -n omec -f values.yaml upf . Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "access-net" namespace: "" from "": no matches for kind "NetworkAttachmentDefinition" in version "k8s.cni.cncf.io/v1" ensure CRDs are installed first, resource mapping not found for name: "core-net" namespace: "" from "": no matches for kind "NetworkAttachmentDefinition" in version "k8s.cni.cncf.io/v1" ensure CRDs are installed first]

I have deployed multus cni plugin and helm install was fine after that.

https://github.com/k8snetworkplumbingwg/multus-cni

but pod is failing while initializing

QoS Class: Guaranteed Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Normal Scheduled 22s default-scheduler Successfully assigned omec/upf-0 to minikube Normal AddedInterface 20s multus Add eth0 [10.244.0.133/16] from bridge Warning FailedCreatePodSandBox 19s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0218e8cc68598c5ed4db6f92bb14a6ede19fd7c1e6887792c031bcfcd8718982" network for pod "upf-0": networkPlugin cni failed to set up pod "upf-0_omec" network: plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"0218e8cc68598c5ed4db6f92bb14a6ede19fd7c1e6887792c031bcfcd8718982" Netns:"/proc/97390/ns/net" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=omec;K8S_POD_NAME=upf-0;K8S_POD_INFRA_CONTAINER_ID=0218e8cc68598c5ed4db6f92bb14a6ede19fd7c1e6887792c031bcfcd8718982" Path:"" ERRORED: error configuring pod [omec/upf-0] networking: [omec/upf-0/:access-net]: error adding container to network "access-net": Link not found ': StdinData: {"capabilities":{"portMappings":true},"clusterNetwork":"/host/etc/cni/net.d/1-k8s.conflist","cniVersion":"0.3.1","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","runtimeConfig":{"portMappings":[]},"type":"multus-shim"} Normal AddedInterface 18s multus Add eth0 [10.244.0.134/16] from bridge

my pod ip address range is 10.244.0.0 and my values are below

enb:
  subnet: 10.244.0.0/16
access:
  ipam: static
  cniPlugin: macvlan
  # Provide sriov resource name when sriov is enabled
  #resourceName: "intel.com/intel_sriov_vfio"
  gateway: 10.244.0.1
  ip: 10.244.0.100
  #mac:
  #vlan:
  iface: enp1s0
core:
  ipam: static
  cniPlugin: macvlan
  # Provide sriov resource name when sriov is enabled
  #resourceName: "intel.com/intel_sriov_vfio"
  gateway: 10.244.0.1
  ip: 10.244.0.134

Am I missing any configuration? Sorry I might be asking some basic questions as I am new to this setup

Puneet1726 commented 3 weeks ago
image
gab-arrobo commented 3 weeks ago

Can you please share the output for kubectl -n <your-namespace> logs upf-0 -c bess-init? Also, I understand that interface enp1s0 exists in your system, correct?

gab-arrobo commented 3 weeks ago

I was just playing around with cpu cores. It was 2G only.

I tried with helm dep up. It is not working. I am facing the issue only with UPF remaining components are up and running.

Just wanted to know any kernel module to be loaded for it to work? Also any specific IP ranges we need to use? I have weave as my network for kubernetes cluster.

Can you help me with the basic network configuration required on the node?

Can you use Aether-OnRamp to deploy the Kubernetes cluster (make aether-k8s-install)? That is, follow the instructions from this link all the way to the deploy kubernetes section. After that, try to deploy the SD-Core using helm (as you have been trying to do: helm install ...)

Puneet1726 commented 2 weeks ago

Looks like we do not need to install multus cni. But if I do helm install it will fail for CRD plugin I am able run init container now but still it shows pod initializing.

I checked the bessd container log from the the node.

Log: 2024-08-26T10:39:46.539919461Z stdout F + bessd -m 0 -f --allow= --grpc_url=0.0.0.0:10514 2024-08-26T10:39:46.676834969Z stdout F I0826 10:39:46.676641 17 main.cc:64] Launching BESS daemon in process mode... 2024-08-26T10:39:46.676850261Z stdout F I0826 10:39:46.676705 17 main.cc:77] bessd v1.0.0-dirty 2024-08-26T10:39:46.679592579Z stdout F I0826 10:39:46.679409 17 bessd.cc:458] Loading plugin (attempt 1): /usr/bin/modules/sequential_update.so 2024-08-26T10:39:46.683026537Z stdout F I0826 10:39:46.682772 17 dpdk.cc:187] Initializing DPDK EAL with options: ["bessd", "--main-lcore", "127", "--lcore", "127@0-15", "--no-shconf", "--legacy-mem", "--no-huge", "-m", "512"] 2024-08-26T10:39:46.693036999Z stdout F EAL: Detected CPU lcores: 16 2024-08-26T10:39:46.693054559Z stdout F EAL: Detected NUMA nodes: 1 2024-08-26T10:39:46.693057689Z stdout F EAL: Detected static linkage of DPDK 2024-08-26T10:39:46.696361322Z stdout F EAL: Selected IOVA mode 'VA' 2024-08-26T10:39:46.696380183Z stdout F EAL: VFIO support initialized 2024-08-26T10:39:46.802555434Z stdout F EAL: Probe PCI driver: net_virtio (1af4:1041) device: 0000:01:00.0 (socket -1) 2024-08-26T10:39:46.802572266Z stdout F eth_virtio_pci_init(): Failed to init PCI device 2024-08-26T10:39:46.802575375Z stdout F EAL: Requested device 0000:01:00.0 cannot be used 2024-08-26T10:39:46.802577346Z stdout F TELEMETRY: No legacy callbacks, legacy socket not created 2024-08-26T10:39:46.802579975Z stdout F Segment 0-0: IOVA:0x100633000, len:4096, virt:0x100633000, socket_id:0

. Assuming a single-node system... 2024-08-26T10:39:47.309223742Z stdout F ^[[0;33mW0826 10:39:47.309146 17 packet_pool.cc:49] Hugepage is disabled! Creating PlainPacketPool for 262144 packets on node 0 2024-08-26T10:39:47.309227832Z stdout F ^[[mI0826 10:39:47.309160 17 packet_pool.cc:74] PacketPool0 requests for 262144 packets 2024-08-26T10:39:47.625036882Z stdout F I0826 10:39:47.624863 17 packet_pool.cc:161] PacketPool0 has been created with 262144 packets 2024-08-26T10:39:47.625310462Z stdout F I0826 10:39:47.625192 17 pmd.cc:74] 0 DPDK PMD ports have been recognized: 2024-08-26T10:39:47.625315654Z stdout F I0826 10:39:47.625232 17 vport.cc:320] vport: BESS kernel module is not loaded. Loading... 2024-08-26T10:39:47.626518205Z stdout F sh: 1: insmod: not found 2024-08-26T10:39:47.626596558Z stdout F ^[[0;33mW0826 10:39:47.626513 17 vport.cc:332] Cannot load kernel module /usr/bin/kmod/bess.ko 2024-08-26T10:39:47.627237405Z stdout F ^[[mI0826 10:39:47.627128 17 bessctl.cc:1928] Server listening on 0.0.0.0:10514 131095,1 Bot

image image

I have attached the latest values file values-1.txt

gab-arrobo commented 2 weeks ago

Based on the upf's log and describe, it looks like you are making changes to bess and/or pfcp agent (upf) and you are building "local" images, correct? Or are you just using a local "mirror" of the images from DockerHub?

Puneet1726 commented 2 weeks ago

It is the same image from docker hub.. I have uploaded it to our registry.. I am only modifying values file.. I am using same tag as docker hub.. not sure what is the issue

Do you see any issue in the log or is it just a warning?

If I do helm install it will fail saying CRD is not present but there is already a network plugin part of bess upf templates…. I am doing workaround to bypass CRD issue.

gab-arrobo commented 2 weeks ago

Can you please provide a git diff between the first values file and the second/last values file you provided? I want to see the difference. As I mentioned before, I deployed the UPF using the initial/first values file you provided by properly adjusting certain parameters to match with my system. Besides that, I do not see any problem. BTW, trying to help in this way will be challenging for me because I do not have the details/specifics of what exactly you are doing. I think it is better to have a live debugging session. Can you join the Slack channel (use this link: https://aetherproject.org/contact-us/)?

Puneet1726 commented 2 weeks ago

Submitted the form

gab-arrobo commented 2 weeks ago

Please provide your email address?

Puneet1726 commented 2 weeks ago

Email: naikpuneet@gmail.com

Puneet1726 commented 2 weeks ago

Hi @gab-arrobo , Please send the slack details

gab-arrobo commented 2 weeks ago

Hi @gab-arrobo , Please send the slack details

I was told that yesterday an email invite to join Slack was sent to you. Did not you receive it?

Puneet1726 commented 2 weeks ago

@gab-arrobo : As discussed, I have setup new one node cluster and deployed upf using helm. upf is installed now. Thanks for all the support.

image
gab-arrobo commented 2 weeks ago

@gab-arrobo : As discussed, I have setup new one node cluster and deployed upf using helm. upf is installed now. Thanks for all the support.

image

BTW, I see you are not using the latest Helm Charts. I strongly recommend using the latest version as it includes several improvements such as new Docker images and nrf caching enabled in some NFs

Also, should I close this issue? or feel free to close it.

Puneet1726 commented 2 weeks ago

@gab-arrobo : I will pull the latest helm chart and merge it to our repo. I will close it once we perform sanity test on the cluster.

gab-arrobo commented 1 week ago

@Puneet1726, any update on this?

Puneet1726 commented 1 week ago

We can close this issue.