openconfig / kne

Apache License 2.0
216 stars 65 forks source link

Problems initializing SRLinux #494

Closed yennym3 closed 2 months ago

yennym3 commented 9 months ago

Hi, I'm deploy the topology 2node-srl-ixr6-with-oc-services.pbtxt, but the containers are unable to stay 'ready' and 'running' as they keep restarting constantly.

Deploying topology:

kne create 2node-srl-ixr6-with-oc-services.pbtxt
I0209 10:00:54.369740 3846339 root.go:119] /home/mw/kne/examples/nokia/srlinux-services
I0209 10:00:54.371543 3846339 topo.go:117] Trying in-cluster configuration
I0209 10:00:54.371573 3846339 topo.go:120] Falling back to kubeconfig: "/home/mw/.kube/config"
I0209 10:00:54.374046 3846339 topo.go:253] Adding Link: srl1:e1-1 srl2:e1-1
I0209 10:00:54.374077 3846339 topo.go:291] Adding Node: srl1:NOKIA
I0209 10:00:54.424631 3846339 topo.go:291] Adding Node: srl2:NOKIA
I0209 10:00:54.459290 3846339 topo.go:358] Creating namespace for topology: "2-srl-ixr6"
I0209 10:00:54.484813 3846339 topo.go:368] Server Namespace: &Namespace{ObjectMeta:{2-srl-ixr6    4b34dc30-d2b2-4340-a901-8967fb08c69e 82945402 0 2024-02-09 10:00:54 +0000 UTC <nil> <nil> map[kubernetes.io/metadata.name:2-srl-ixr6] map[] [] [] [{kne Update v1 2024-02-09 10:00:54 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:kubernetes.io/metadata.name":{}}}} }]},Spec:NamespaceSpec{Finalizers:[kubernetes],},Status:NamespaceStatus{Phase:Active,Conditions:[]NamespaceCondition{},},}
I0209 10:00:54.485491 3846339 topo.go:395] Getting topology specs for namespace 2-srl-ixr6
I0209 10:00:54.485510 3846339 topo.go:324] Getting topology specs for node srl1
I0209 10:00:54.485574 3846339 topo.go:324] Getting topology specs for node srl2
I0209 10:00:54.485610 3846339 topo.go:402] Creating topology for meshnet node srl1
I0209 10:00:54.507333 3846339 topo.go:402] Creating topology for meshnet node srl2
I0209 10:00:54.522376 3846339 topo.go:375] Creating Node Pods
I0209 10:00:54.522726 3846339 nokia.go:201] Creating Srlinux node resource srl1
I0209 10:00:54.537059 3846339 nokia.go:206] Created SR Linux node srl1 configmap
I0209 10:00:54.631596 3846339 nokia.go:265] Created Srlinux resource: srl1
I0209 10:00:54.764968 3846339 topo.go:380] Node "srl1" resource created
I0209 10:00:54.765040 3846339 nokia.go:201] Creating Srlinux node resource srl2
I0209 10:00:54.780052 3846339 nokia.go:206] Created SR Linux node srl2 configmap
I0209 10:00:54.910542 3846339 nokia.go:265] Created Srlinux resource: srl2
I0209 10:00:55.028768 3846339 topo.go:380] Node "srl2" resource created
I0209 10:04:15.460792 3846339 topo.go:448] Node "srl1": Status RUNNING

Status of the pods:

k get pods -n 2-srl-ixr6 
NAME   READY   STATUS    RESTARTS     AGE
srl1   0/1     Running   1 (9s ago)   13s
srl2   0/1     Running   1 (9s ago)   13s

k get pods -n 2-srl-ixr6 
NAME   READY   STATUS                  RESTARTS     AGE
srl1   0/1     Init:CrashLoopBackOff   1 (8s ago)   16s
srl2   0/1     Init:CrashLoopBackOff   1 (8s ago)   16s

 k get pods -n 2-srl-ixr6 
NAME   READY   STATUS   RESTARTS   AGE
srl1   0/1     Error    2          32s
srl2   0/1     Error    2          32s

Events for the container srl1:

 Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       6m9s                  default-scheduler  Successfully assigned 2-srl-ixr6/srl1 to k8worker4
  Normal   Killing         5m59s (x2 over 6m5s)  kubelet            Stopping container srl1
  Warning  BackOff         5m56s                 kubelet            Back-off restarting failed container init-srl1 in pod srl1_2-srl-ixr6(600952b1-695d-44c3-95a0-a68ba2f9be5a)
  Normal   SandboxChanged  5m55s (x3 over 6m5s)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          5m50s (x3 over 6m8s)  kubelet            Container image "ghcr.io/srl-labs/init-wait:latest" already present on machine
  Normal   Created         5m50s (x3 over 6m8s)  kubelet            Created container init-srl1
  Normal   Started         5m50s (x3 over 6m7s)  kubelet            Started container init-srl1
  Warning  BackOff         5m50s                 kubelet            Back-off restarting failed container srl1 in pod srl1_2-srl-ixr6(600952b1-695d-44c3-95a0-a68ba2f9be5a)
  Normal   Pulled          5m49s (x3 over 6m6s)  kubelet            Container image "ghcr.io/nokia/srlinux" already present on machine
  Normal   Created         5m48s (x3 over 6m6s)  kubelet            Created container srl1
  Normal   Started         5m48s (x3 over 6m6s)  kubelet            Started container srl1
LimeHat commented 9 months ago

You need to have a license for srlinux

yennym3 commented 8 months ago

You need to have a license for srlinux

In this documentation https://learn.srlinux.dev/tutorials/infrastructure/kne/installation/#license it mentions that it is possible to use SRLinux without a license by removing certain fields, which I have tried but the error I mentioned above still occurs.

hellt commented 8 months ago

Without sharing the exact topology you try to start it is not possible to answer any questions

yennym3 commented 8 months ago

Without sharing the exact topology you try to start it is not possible to answer any questions

The topology I am testing is exactly the same as the one provided in the example repository, https://github.com/openconfig/kne/blob/main/examples/nokia/srlinux-services/2node-srl-ixr6-with-oc-services.pbtxt

hellt commented 8 months ago

it can't be the same, since you should have removed the ixr6e model from the topology and openconfig models from the config

yennym3 commented 8 months ago

it can't be the same, since you should have removed the ixr6e model from the topology and openconfig models from the config

I'm sorry for any confusion. I meant to say that the topology I'm using is based on the example from the repository. I've tested it in both configurations with and without, the 'ixr6e' model and OpenConfig models from the configuration. However, I have had the same result in both cases.

LimeHat commented 8 months ago

You need to investigate pod logs to understand the reason; most likely, you need more changes than a simple removal of the model & openconfig container. There are a few other things in the cfg that are not supported on other platforms.

Starting with the default config is your best bet, reusing the configs from ixr6/10 examples on other platforms is unlikely to give you good results.

yennym3 commented 7 months ago

Hi,

Currently I persist the error that I have commented on the restart of the pods, looking at the documentation https://learn.srlinux.dev/tutorials/infrastructure/kne/installation/#__tabbed_2_1 in the tutorial indicates that it was used as a test k8s cluster kind, the problem of restarting the pods occurs when I deploy the pods on an external cluster that was created with kubeadm and not kin

I have observed in the srlinus-controller logs when creating the pods the following errors.

1.7104141231090307e+09  INFO    updating srlinux status {"controller": "srlinux", "controllerGroup": "kne.srlinux.dev", "controllerKind": "Srlinux", "Srlinux": {"name":"srl1","namespace":"2srl-prueba-2"}, "namespace": "2srl-prueba-2", "name": "srl1", "reconcileID": "f0b1efe1-1c56-44a9-a205-6dd38b58f561", "srlinux-status": {"status":"Pending","image":"ghcr.io/nokia/srlinux:latest","startup-config":{}}}
1.7104141231321757e+09  **ERROR**   failed to update Srlinux status {"controller": "srlinux", "controllerGroup": "kne.srlinux.dev", "controllerKind": "Srlinux", "Srlinux": {"name":"srl1","namespace":"2srl-prueba-2"}, "namespace": "2srl-prueba-2", "name": "srl1", "reconcileID": "f0b1efe1-1c56-44a9-a205-6dd38b58f561", "error": "Operation cannot be fulfilled on srlinuxes.kne.srlinux.dev \"srl1\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/srl-labs/srl-controller/controllers.(*SrlinuxReconciler).updateSrlinuxStatus
        /workspace/controllers/srlinux_controller.go:265
github.com/srl-labs/srl-controller/controllers.(*SrlinuxReconciler).Reconcile
        /workspace/controllers/srlinux_controller.go:123
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:121
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234
1.7104141231323476e+09  **ERROR**   Reconciler error        {"controller": "srlinux", "controllerGroup": "kne.srlinux.dev", "controllerKind": "Srlinux", "Srlinux": {"name":"srl1","namespace":"2srl-prueba-2"}, "namespace": "2srl-prueba-2", "name": "srl1", "reconcileID": "f0b1efe1-1c56-44a9-a205-6dd38b58f561", "error": "Operation cannot be fulfilled on srlinuxes.kne.srlinux.dev \"srl1\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234

I have tested to deploy the srlinux on kind cluster and this problem does not happen.

Has anyone had this same problem when not using kind as a cluster and would know how to solve it?

hellt commented 7 months ago

this error on its own doesn't lead to any issues. The reconciliation should still happen. If you see your pods not coming up, then something else prevents it, not the reconciliation error. I saw this error in my clusters, but it is transient and goes away