liqotech / liqo

Enable dynamic and seamless Kubernetes multi-cluster topologies
https://liqo.io
Apache License 2.0
1.09k stars 103 forks source link

liqoctl unpeer doesn't remove broken peering connection #1710

Open leonardopoggiani opened 1 year ago

leonardopoggiani commented 1 year ago

What happened:

If I try to create a peering between two clusters and for some reason it fails (e.g. a firewall rule prevents communication with the liqo-auth service or liqo-auth and liqo-gateway are deployed as load balancers with a layer two MetalLB without being directly reachable) two crd resources are left tunnelendpoint.net.liqo.io and networkconfigs.net.liqo.io respectively. These two resources are not deleted with a "liqoctl unpeer " and prevent the "liqoctl uninstall" by causing it to fail with a network wait timeout error.

What you expected to happen:

I expect that a liqoctl unpeer will also remove broken liqo installations, or at least the liqoctl uninstall will not start.

How to reproduce it (as minimally and precisely as possible):

I know the steps I list lead to a broken liqo installation, and they are wrong!

  1. Create the kind clusters using the quick-start example.
  2. Install MetalLB on each cluster with two different ip pools (e.g. 10.96.100.0/24 and 10.104.100.0/24)
  3. Install liqo on each cluster, requiring that the liqo-auth and the liqo-gateway services are deployed with Load Balancer. ( liqoctl install kind --cluster-name rome --set gateway.service.type=LoadBalancer --set auth.service.type=LoadBalancer )
  4. Try in-band peering ( liqoctl peer in-band --remote-kubeconfig liqo_kubeconf_milan )
  5. Wait for the peering to fail. Now there are two crd resources created that are broken (tunnelendpoints.net.liqo.io and networkconfigs.net.liqo.io).
  6. liqoctl unpeer milan returns a success, but the resources are still there and if I try to do a liqoctl uninstall it passes the uninstallation checks and then fail on timeout.

Anything else we need to know?:

A quick workaround is just deleting the resources manually ( kubectl delete tunnelendpoints.net.liqo.io -n <liqo-tenant-namespace> and kubectl delete networkconfigs.net.liqo.io -n <liqo-tenant-namespace> ).

I repeat that I know these steps are not the correct ones for installing and configuring liqo, but in doing some trial and error I came across this situation that seemed anomalous.

liqo-controller logs:

E0313 10:06:38.671955       1 foreign-cluster-controller.go:218] Failed to ensure identity for remote cluster "milan": failed to send identity request: Post "https://10.203.0.3:443/identity/certificate": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I0313 10:06:38.674710       1 trace.go:219] Trace[2000615844]: "Reconcile" ForeignCluster:milan (13-Mar-2023 10:06:33.669) (total time: 5004ms):
Trace[2000615844]: ---"ForeignCluster status update" 5004ms (10:06:38.674)
Trace[2000615844]: [5.004700396s] [5.004700396s] END

liqoctl uninstall error:

 ERRO  Error uninstalling Liqo: timed out waiting for the condition      

liqoctl unpeer milan result:

 INFO  Outgoing peering marked as disabled                                                             
 INFO  Successfully disabled outgoing peering to the remote cluster "milan"    

liqoctl peer command error:

 ERRO   (local) Failed establishing networking to the remote cluster "milan": timed out waiting for the condition                                                                                                 

Environment:

Wangxinxinhappy commented 1 year ago

Hi, I think I have almost same problem as yours. Firstly, I peer two clusters successfully. When I try to unpeer them, problem comes out.

liqoctl unpeer error:

ERRO  Failed disabling outgoing peering to the remote cluster "milan": timed out waiting for the condition

then, I try these steps (kubectl delete tunnelendpoints.net.liqo.io -n <liqo-tenant-namespace>and kubectl delete networkconfigs.net.liqo.io -n <liqo-tenant-namespace>). And peer the two clusters again.

liqoctl peer error:

 ERRO  Failed activating outgoing peering to the remote cluster "milan": timed out waiting for the condition

But problem still exists. Do you know how to fix it? Please help me. 😭

cheina97 commented 1 year ago

HI @Wangxinxinhappy what I suggest to you is to remove the foreignclusters resources and the **liqo-tenant-** namespaces (on both sides). If you meet some problems deleting the namespaces check that resourceoffers* in tenant namespaces have been deleted (if not delete the finalizers on them by hand).

Wangxinxinhappy commented 1 year ago

HI, @cheina97 Thanks for your reply! I just remove the foreignclusters resources and the liqo-tenant- namespaces. And then reinstall liqo on both sides. But `liqoctl peer out-of-band nameless-brook ` still failed:

 INFO  Peering enabled                                                                                                                                           
 INFO  Authenticated to cluster "nameless-brook"                                                                                                                 
 ERRO  Failed activating outgoing peering to the remote cluster "nameless-brook": timed out waiting for the condition
yoctozepto commented 9 months ago

Hi @cheina97 I see this issue is still present in the latest (v0.10.1) Liqo and happens when peering fails for any reason.

In my case though, the uninstall fails trying to remove tunnelendpoints.net.liqo.io. I saw it runs a job liqo-pre-delete which tries to clean up everything but it fails for this resource kind. I cannot delete this tunnelendpoint myself either. It just timeouts. I also cannot edit it to remove the finalizers:

  finalizers:
  - liqo-gateway.net.liqo.io
  - liqo-route.10.5.0.5.net.liqo.io

kubectl throws back at me:

error: tunnelendpoints.net.liqo.io "misty-thunder-690265" could not be found on the server
The edits you made on deleted resources have been saved to "/tmp/kubectl-edit-1165863843.yaml"

Yet it is very much found and blocking the uninstall:

NAME                   PEERING CLUSTER   BACKEND TYPE   CONNECTION STATUS   AGE
misty-thunder-690265   misty-thunder     wireguard      Connected           2d3h
yoctozepto commented 9 months ago

FWIW, its namespace was already deleted...

yoctozepto commented 9 months ago

Workaround

All right, I recreated the namespace and then I was able to remove the finalizers (and had to also delete the invalid ownerReferences) and delete it. Oh boy, it seems the unpeer + uninstall create some nice confusion together! 😅

cheina97 commented 9 months ago

Hi @yoctozepto, we know that peering is one of the most problematic parts in Liqo, this year we are working on making Liqo modular and one of the objectives is to remove the actual peer mechanism and replace it with a declarative and clean approach.

I'm sorry for your issue and happy you found a solution.

In the next months we are going to release the new Liqo network, which replace the current one, it will be independent from the rest of Liqo and will solve problems like this one. Stay tuned

yoctozepto commented 9 months ago

Thanks for the summary @cheina97 and no need to be sorry! It works very fine so far except for these quirks. I have seen these various improvements being mentioned around the issues I happened to see when filtering for relevant ones. Do you have a central place where you track these design decisions and the related work? Keeping my fingers crossed and looking forward to seeing this future liqo!

cheina97 commented 9 months ago

Network modularity lacks a public design on github at the moment. Surely in the future, we will share better insight about design for the next modularity steps.