Open sumitAtDigital opened 1 year ago
Additional logs and comment: _In the logs, old foreign cluster names and tenant namespaces are getting printed._These resources do not exist now.
E0904 16:25:28.300569 1 controller.go:324] "Reconciler error" err="namespaces \"liqo-tenant-adm1-npp1-6eb08f\" not found" controller="resourceoffer" controllerGroup="sharing.liqo.io" controllerKind="ResourceOffer" ResourceOffer="liqo-tenant-adm1-npp1-6eb08f/adm1-npp1" namespace="liqo-tenant-adm1-npp1-6eb08f" name="adm1-npp1" reconcileID="b9180d88-279b-4136-a0e4-4d5f969a6299" I0904 16:25:28.303742 1 deletion-routine.go:60] Deletion routine started for virtual node liqo-adm1-npp1 E0904 16:25:28.311935 1 deletion-routine.go:105] error removing finalizer: namespaces "liqo-tenant-throbbing-darkness-ec30d7" not found _E0904 16:25:28.325475 1 resourceoffercontroller.go:110] namespaces "liqo-tenant-adm1-npp1-6eb08f" not found E0904 16:25:28.325527 1 controller.go:324] "Reconciler error" err="namespaces \"liqo-tenant-adm1-npp1-6eb08f\" not found" controller="resourceoffer" controllerGroup="sharing.liqo.io" controllerKind="ResourceOffer" ResourceOffer="liqo-tenant-adm1-npp1-6eb08f/adm1-npp1" namespace="liqo-tenant-adm1-npp1-6eb08f" name="adm1-npp1" reconcileID="dd3d8afe-eac2-4324-b1d6-2cd40c29cc6c" I0904 16:25:28.349832 1 controller.go:219] "Starting workers" controller="shadowpod" controllerGroup="virtualkubelet.liqo.io" controllerKind="ShadowPod" worker count=10 I0904 16:25:28.349886 1 controller.go:219] "Starting workers" controller="namespacemap" controllerGroup="virtualkubelet.liqo.io" controllerKind="NamespaceMap" worker count=1 I0904 16:25:28.349936 1 controller.go:219] "Starting workers" controller="namespaceoffloading" controllerGroup="offloading.liqo.io" controllerKind="NamespaceOffloading" worker count=1 I0904 16:25:28.350865 1 controller.go:219] "Starting workers" controller="shadowendpointslice" controllerGroup="virtualkubelet.liqo.io" controllerKind="ShadowEndpointSlice" worker count=10 I0904 16:25:28.352241 1 deletion-routine.go:60] Deletion routine started for virtual node liqo-adm1-npp1 E0904 16:25:28.357533 1 deletion-routine.go:105] error removing finalizer: namespaces "liqo-tenant-throbbing-darkness-ec30d7" not found I0904 16:25:28.358565 1 namespaceoffloading_controller.go:96] NamespaceOffloading "adm2/offloading" status correctly updated I0904 16:25:28.358645 1 controller.go:114] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="namespaceoffloading" controllerGroup="offloading.liqo.io" controllerKind="NamespaceOffloading" NamespaceOffloading="adm2/offloading" namespace="adm2" name="offloading" reconcileID="98c8602a-1a05-464f-9d78-8faa428781e5" panic: runtime error: invalid memory address or nil pointer dereference [recovered]
Hi @sumitAtDigital thanks for your issue. I've just tried your configuration (k8s version + liqo version) on KinD and it works. Can you provide information about how to replicate the problem? For example the list of commands and other details.
Unfortunately, I don't have experience with PKS/TKGI and at the moment we don't have an infrastructure which supports it. @aleoli do you have any experience?
Hi @cheina97, Thanks for the reply. This was working till 0.8.3 version.
Cluster inbound Peering is working fine with v0.9.3. I deleted the previous foreign clusters peered earlier, to test the updated liqo version and changed the cluster name in the values file to make it short.
Commands:
liqoctl offload namespace adm2 --namespace-mapping-strategy EnforceSameName --pod-offloading-strategy LocalAndRemote
_liqoctl offload namespace adm2 --namespace-mapping-strategy EnforceSameName --pod-offloading-strategy LocalAndRemote --selector "kubernetes.io/hostname=liqo-adm1-npp1" Same results: Pods are offloaded but, Liqo controller pod is throwing error.
But, whenever, I unoffload the namespace, controller pod comes back to normal with below logs:
E0904 17:33:30.021742 1 deletion-routine.go:105] error removing finalizer: namespaces "liqo-tenant-throbbing-darkness-ec30d7" not found I0904 17:33:30.035973 1 deletion-routine.go:60] Deletion routine started for virtual node liqo-adm1-npp1 E0904 17:33:30.042652 1 deletion-routine.go:105] error removing finalizer: namespaces "liqo-tenant-throbbing-darkness-ec30d7" not found I0904 17:33:32.582737 1 deletion-routine.go:60] Deletion routine started for virtual node liqo-adm1-npp1 E0904 17:33:32.590952 1 deletion-routine.go:105] error removing finalizer: namespaces "liqo-tenant-throbbing-darkness-ec30d7" not found
In addition to that, we are getting below warning intermittently, and in this case pods are not offloaded to the required cluster: _> liqoctl offload namespace adm2 --namespace-mapping-strategy EnforceSameName --pod-offloading-strategy LocalAndRemote__ INFO Offloading of namespace "adm2" correctly enabled WARN Offloading completed, but no cluster was selected
_> liqoctl offload namespace adm2 --namespace-mapping-strategy EnforceSameName --pod-offloading-strategy LocalAndRemote --selector "kubernetes.io/hostname=liqo-adm1-npp1" INFO Offloading of namespace "adm2" correctly enabled WARN Offloading completed, but no cluster was selected
Hi, @sumitAtDigital still cannot find a way to replicate your problem. We linked a PR, but we don't think it resolves your issue. Have you experienced these errors with other cloud providers or platforms (e.g. KinD)?
HI @cheina97, Thanks for performing tests. Let me try to elaborate further.
Below are some observations:
Tested with --bidirectional peering with cluster1<-->cluster2 .
cluster1-->cluster2 namespace offload working as required. while cluster2-->cluster1 is throwing runtime error in the controller pod, as soon as we run the namespace offload command.
Only after adding tolerations manually to deployments, I can offload the pods from cluster2-->cluster1. But still liqo controller pod is restarting again with error. _spec: tolerations:
I still believe there is some issue in the controller code. As, only 1 foreign cluster is available, why code is searching for old/deleted Foreign clusters. > kubectl get foreigncluster NAME TYPE OUTGOING PEERING INCOMING PEERING NETWORKING AUTHENTICATION AGE adm1-npp1 InBand Established Established Established Established 42h
Please check once, where these conditions written in the golang code, which is throwing runtime error.
W0906 08:54:18.253257 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID f4cb997d-0bfc-4c0a-a8c2-dbb7b4c40cdf" not found W0906 08:54:18.253365 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID 20b965f1-9a3e-4483-984e-83c1624bfc44" not found W0906 08:54:18.253406 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID 8e604d94-81c0-4a46-8800-86eff01ac149" not found W0906 08:54:18.253444 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID 24daa06b-48c2-4ade-b54e-d51ef8f9e570" not found I0906 08:54:18.253478 1 cache.go:86] Cache initialization completed. Found 5 peered clusters I0906 08:54:18.262198 1 namespaceoffloading_controller.go:96] NamespaceOffloading "adm2/offloading" status correctly updated I0906 08:54:18.262223 1 controller.go:114] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="namespaceoffloading" controllerGroup="offloading.liqo.io" controllerKind="NamespaceOffloading" NamespaceOffloading="adm2/offloading" namespace="adm2" name="offloading" reconcileID="c73479f5-5e71-4ba1-9018-6a77c9da3384" panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x108 pc=0x19a5e42] goroutine 538 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:115 +0x1e5
May I request you to please revisit: _https://github.com/liqotech/liqo/blob/master/pkg/liqo-controller-manager/namespaceoffloading-controller/namespaceoffloading_controller.go_ Line 115/116 and few others where the runtime error is thrown, if strategy is not matching or not available.
@cheina97 : It would be great, if you please share more info on the liqo controller errors/logs, as why those specific lies of codes throwing runtime errors and why it is searching for already deleted foreign clusters, while only one existing in reality.
Hi @sumitAtDigital I'm a bit confused about the problem you are having. I don't understand if you are encountering these problems only after you upgraded from v0.8.3 to v0.9.3 or even if you start from a clean cluster and install the v0.9.3
I also don't understand if you are still observing all the errors presented in the issue.
Have you tried to restart the controllers killing all Liqo pods?
HI @cheina97, Thanks for performing tests. Let me try to elaborate further.
Below are some observations:
- Tested with --bidirectional peering with cluster1<-->cluster2 . cluster1-->cluster2 namespace offload working as required. while cluster2-->cluster1 is throwing runtime error in the controller pod, as soon as we run the namespace offload command.
- Only after adding tolerations manually to deployments, I can offload the pods from cluster2-->cluster1. But still liqo controller pod is restarting again with error. _spec: tolerations:
- key: virtual-node.liqo.io/not-allowed operator: Equal value: "true" effect: NoExecute_
- I still believe there is some issue in the controller code. As, only 1 foreign cluster is available, why code is searching for old/deleted Foreign clusters. > kubectl get foreigncluster NAME TYPE OUTGOING PEERING INCOMING PEERING NETWORKING AUTHENTICATION AGE adm1-npp1 InBand Established Established Established Established 42h
Please check once, where these conditions written in the golang code, which is throwing runtime error.
W0906 08:54:18.253257 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID f4cb997d-0bfc-4c0a-a8c2-dbb7b4c40cdf" not found W0906 08:54:18.253365 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID 20b965f1-9a3e-4483-984e-83c1624bfc44" not found W0906 08:54:18.253406 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID 8e604d94-81c0-4a46-8800-86eff01ac149" not found W0906 08:54:18.253444 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID 24daa06b-48c2-4ade-b54e-d51ef8f9e570" not found I0906 08:54:18.253478 1 cache.go:86] Cache initialization completed. Found 5 peered clusters I0906 08:54:18.262198 1 namespaceoffloading_controller.go:96] NamespaceOffloading "adm2/offloading" status correctly updated I0906 08:54:18.262223 1 controller.go:114] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="namespaceoffloading" controllerGroup="offloading.liqo.io" controllerKind="NamespaceOffloading" NamespaceOffloading="adm2/offloading" namespace="adm2" name="offloading" reconcileID="c73479f5-5e71-4ba1-9018-6a77c9da3384" panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x108 pc=0x19a5e42] goroutine 538 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:115 +0x1e5
- May I request you to please revisit: _https://github.com/liqotech/liqo/blob/master/pkg/liqo-controller-manager/namespaceoffloading-controller/namespaceoffloading_controller.go_ Line 115/116 and few others where the runtime error is thrown, if strategy is not matching or not available.
kubectl get resourceoffers.sharing.liqo.io -A
. If there are some unwanted resourceoffers remove their finalizers and delete them.I suggest to you to operate in this order:
I suggest to you to operate in this order:
- Unpeer your clusters with liqoctl. Even If you encounter some errors continue to the next step.
- Delete all foreignclusters resources on both clusters.
- If foreignclusters cannot be deleted, check if there are some resourcerequests or resourceoffers, and delete them. Then delete again the foreignclusters. If you cannot delete a resource check if it contains a finalizer and remove it.
- Delete all liqo-tenant namespaces.
- Peer again
@cheina97 : Thanks for pointing out. We have done already these steps.
HI @cheina97, Thanks for performing tests. Let me try to elaborate further. Below are some observations:
- Tested with --bidirectional peering with cluster1<-->cluster2 . cluster1-->cluster2 namespace offload working as required. while cluster2-->cluster1 is throwing runtime error in the controller pod, as soon as we run the namespace offload command.
Only after adding tolerations manually to deployments, I can offload the pods from cluster2-->cluster1. But still liqo controller pod is restarting again with error. _spec: tolerations:
- key: virtual-node.liqo.io/not-allowed operator: Equal value: "true" effect: NoExecute_
- I still believe there is some issue in the controller code. As, only 1 foreign cluster is available, why code is searching for old/deleted Foreign clusters. > kubectl get foreigncluster NAME TYPE OUTGOING PEERING INCOMING PEERING NETWORKING AUTHENTICATION AGE adm1-npp1 InBand Established Established Established Established 42h
Please check once, where these conditions written in the golang code, which is throwing runtime error. W0906 08:54:18.253257 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID f4cb997d-0bfc-4c0a-a8c2-dbb7b4c40cdf" not found W0906 08:54:18.253365 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID 20b965f1-9a3e-4483-984e-83c1624bfc44" not found W0906 08:54:18.253406 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID 8e604d94-81c0-4a46-8800-86eff01ac149" not found W0906 08:54:18.253444 1 cache.go:246] foreignclusters.discovery.liqo.io "foreign cluster with ID 24daa06b-48c2-4ade-b54e-d51ef8f9e570" not found I0906 08:54:18.253478 1 cache.go:86] Cache initialization completed. Found 5 peered clusters I0906 08:54:18.262198 1 namespaceoffloading_controller.go:96] NamespaceOffloading "adm2/offloading" status correctly updated I0906 08:54:18.262223 1 controller.go:114] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="namespaceoffloading" controllerGroup="offloading.liqo.io" controllerKind="NamespaceOffloading" NamespaceOffloading="adm2/offloading" namespace="adm2" name="offloading" reconcileID="c73479f5-5e71-4ba1-9018-6a77c9da3384" panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x108 pc=0x19a5e42] goroutine 538 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:115 +0x1e5
- May I request you to please revisit: _https://github.com/liqotech/liqo/blob/master/pkg/liqo-controller-manager/namespaceoffloading-controller/namespaceoffloading_controller.go_ Line 115/116 and few others where the runtime error is thrown, if strategy is not matching or not available.
- I think there are some resourceoffers pending in your cluster. Please checks with
kubectl get resourceoffers.sharing.liqo.io -A
. If there are some unwanted resourceoffers remove their finalizers and delete them.- If you check VirtualNode: namespacemap virtualnode selector #1977 you can notice we made some changes in that part of the code.
@cheina97 : This is really helpful. As, we were not aware about the residue left after deletion. ran the command provided and below are results:
kubectl get resourceoffers.sharing.liqo.io -A NAMESPACE NAME STATUS VIRTUALKUBELETSTATUS LOCAL AGE liqo-tenant-adm1-npp1-6eb08f adm1-npp1 Accepted Deleting false 42d liqo-tenant-adm1-npp1-6eb08f adm2-npp1 Accepted Deleting true 42d liqo-tenant-adm1-npp1-9cf9c5 adm1-npp1 Accepted Created false 3d9h liqo-tenant-adm1-npp1-9cf9c5 adm2-npp1 Accepted Created true 3d9h liqo-tenant-adm1-npp1-aa1366 adm2-npp1 Accepted Deleting true 38d liqo-tenant-adm1-npp1-d514ce adm2-npp1 Accepted Deleting true 38d liqo-tenant-autumn-river-b6b769 dsx-admiralty-2-npp1-pks Accepted None true 45d
Tried deleting, but not working with force even:
kubectl delete resourceoffers.sharing.liqo.io adm1-npp1 -n liqo-tenant-adm1-npp1-6eb08f --force warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. resourceoffer.sharing.liqo.io "adm1-npp1" force deleted>
Is there any other command reference, or patch that can remove the deleting resourceoffers.sharing.liqo.io? Please suggest.
Tried deleting, but not working with force even:
kubectl delete resourceoffers.sharing.liqo.io adm1-npp1 -n liqo-tenant-adm1-npp1-6eb08f --force warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. resourceoffer.sharing.liqo.io "adm1-npp1" force deleted>
Is there any other command reference, or patch that can remove the deleting resourceoffers.sharing.liqo.io? Please suggest.
Have you removed the finalized on the resource?
Tried, but no success:
kubectl --kubeconfig ./config-file patch resourceoffers.sharing.liqo.io adm1-npp1 -n liqo-tenant-adm1-npp1-6eb08f -p '{"metadata":{"finalizers":null}}' --type=merge Error from server (NotFound): namespaces "liqo-tenant-adm1-npp1-6eb08f" not found
What about if you use kubectl edit?
What about if you use kubectl edit?
Tried deleting that by removing the finalizers and saving, but no success: > kubectl edit resourceoffers.sharing.liqo.io adm1-npp1 -n liqo-tenant-adm1-npp1-6eb08f error: resourceoffers.sharing.liqo.io "adm1-npp1" could not be found on the server The edits you made on deleted resources have been saved to "C:\Users\xyz\AppData\Local\Temp\kubectl.exe-edit-1916598350.yaml"
apiVersion: sharing.liqo.io/v1alpha1 kind: ResourceOffer metadata: creationTimestamp: "2023-08-04T04:24:54Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2023-08-08T10:38:50Z" finalizers:
That's really strange, can you check if the resourceoffers are recreating? Use kubectl get resourceoffers -A -w to check if the resources are updating
It's simply stuck/hanging from last 2-3 minutes:
kubectl get resourceoffers.sharing.liqo.io -A -w NAMESPACE NAME STATUS VIRTUALKUBELETSTATUS LOCAL AGE liqo-tenant-adm1-npp1-6eb08f adm1-npp1 Accepted Deleting false 42d liqo-tenant-adm1-npp1-6eb08f adm2-npp1 Accepted Deleting true 42d liqo-tenant-adm1-npp1-9cf9c5 adm1-npp1 Accepted Created false 3d10h liqo-tenant-adm1-npp1-9cf9c5 adm2-npp1 Accepted Created true 3d10h liqo-tenant-adm1-npp1-aa1366 adm2-npp1 Accepted Deleting true 38d liqo-tenant-adm1-npp1-d514ce adm2-npp1 Accepted Deleting true 38d liqo-tenant-autumn-river-b6b769 dsx-admiralty-2-npp1-pks Accepted None true 45d
Ok, so it is not strange
Can you get the single resource?
sure, it hangs without final results: kubectl get resourceoffers.sharing.liqo.io -n liqo-tenant-adm1-npp1-6eb08f -w NAME STATUS VIRTUALKUBELETSTATUS LOCAL AGE adm1-npp1 Accepted Deleting false 42d adm2-npp1 Accepted Deleting true 42d
@cheina97 : Removed version 0.9.3 and installed 0.9.4, but still the same error:
I0920 15:26:26.279404 1 finalizer.go:37] Removing finalizer virtualnode-controller.liqo.io/finalizer from virtual-node liqo-adm1-npp1 I0920 15:26:26.289221 1 namespaceoffloading_controller.go:98] NamespaceOffloading "adm2/offloading" status correctly updated I0920 15:26:26.289267 1 controller.go:114] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="namespaceoffloading" controllerGroup="offloading.liqo.io" controllerKind="NamespaceOffloading" NamespaceOffloading="adm2/offloading" namespace="adm2" name="offloading" reconcileID="201d8339-9923-45ba-b8db-40b1f5f8da7b" panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x108 pc=0x1c17962]
DId you restart from a clean cluster?
@cheina97 : Yes, it was done from scratch (complete deletion and then fresh install), but on the same cluster and with same name. However, we have put a halt on Liqo cluster peering research for now. Thanks for your prompt support. It helped to understand this product surely.
We will let you know, if we resume again with Liqo in future.
What happened: Testing cluster peering and namespace/pod offloading. Namespace offloading results in below liqo-controller-manager pod error:
I0904 16:15:12.246218 1 namespaceoffloading_controller.go:96] NamespaceOffloading "adm2/offloading" status correctly updated I0904 16:15:12.246253 1 controller.go:114] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="namespaceoffloading" controllerGroup="offloading.liqo.io" controllerKind="NamespaceOffloading" NamespaceOffloading="adm2/offloading" namespace="adm2" name="offloading" reconcileID="1276e086-6b1c-4df6-93a1-524933ad0de5" panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x108 pc=0x19a5e42]
goroutine 558 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:115 +0x1e5
Environment: Developemnt
kubectl version
): v1.23.3