karmab / kcli

Management tool for virtualization and kubernetes platforms
https://kcli.readthedocs.io/en/latest/
Apache License 2.0
510 stars 139 forks source link

Disconnected clusters force the usage of sslip.io for the mirror registry, although it cannot be available #670

Closed palonsoro closed 6 months ago

palonsoro commented 6 months ago

In a purely disconnected environment, internet DNS names may not be resolvable. However, disconnected installations force the usage of sslip.io domain for the disconnected registry, so if the environment is truly disconnected and sslip.io is not resolvable, installation fails.

I tried setting disconnected_url as a workaround but it is ignored, so I cannot specify an alternate hostname that is resolvable in the disconnected environment.

Disconnected environment where I test is just deploying on top of an isolated libvirt network.

palonsoro commented 6 months ago

To be more concrete: disconnected_url seems to impact the OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE used but not the configured mirroring.

If we could have disconnected_url to just act as a custom domain name for all the disconnected stuff even when disconnected_vm is true, then we may be able to configure a custom domain for the registry and avoid sslip.io, which is not available.

karmab commented 6 months ago

while I understand sslip dns servers might not be reachable in a purely isolated context, you would need a way to resolve the registry fqdn (let'say we use the disconnected_url) to the ip of the registry vm. How are you handling that now?

palonsoro commented 6 months ago

In my environment, I just rely on libvirt default DNS service, so the VMs are resolvable as per the names and domains on the network. However, only the VMs can be resolved by default if the network is isolated:

[root@revali-worker-0 ~]# nslookup revali-disconnected.revali.killua.sce
Server:     192.168.196.142
Address:    192.168.196.142#53

Name:   revali-disconnected.revali.killua.sce
Address: 192.168.196.2

[root@revali-worker-0 ~]# nslookup www.google.es
Server:     192.168.196.142
Address:    192.168.196.142#53

** server can't find www.google.es: REFUSED

[root@revali-worker-0 ~]# nslookup revali-disconnected.revali.killua.sce 192.168.196.1
Server:     192.168.196.1
Address:    192.168.196.1#53

Name:   revali-disconnected.revali.killua.sce
Address: 192.168.196.2

[root@revali-worker-0 ~]# nslookup www.google.es 192.168.196.1
Server:     192.168.196.1
Address:    192.168.196.1#53

** server can't find www.google.es: REFUSED

(above I show the default resolution, that goes via local coredns, and also pointing to the .1 IP which is the IP used by libvirt to provide DNS service)

If we were talking about "a more realistic environment", it would be good IMHO to provide as alternative to just set the disconnected_url and ask the user to pre-create on their own DNS servers whatever domain name they have specified.

An alternative would be to run a local sslip.io service on all the VMs in a static pod and have the local CoreDNS pods forwards sslip.io domains to them, but maybe this is an overkill.

BTW, in case you are wondering how did I manage to install the cluster: I forced libvirt to forward sslip.io requests to the authoritative DNS servers of sslip.io with something like this:

<dns>
  <forwarder domain='sslip.io' addr='52.0.56.137'/>
  (...)
</dns>

However, this makes the installation to not be 100% disconnected, as I need to reach the internet (even if through the local libvirt DNS). So this is why I'd prefer to have some option to just rely on the DNS names that libvirt is already providing me.

palonsoro commented 6 months ago

Another possible alternative (not sure if realizable, I need to check): sslip.io behavior might be possible to emulate by using CoreDNS template plugin: https://coredns.io/plugins/template/

This example seems to be near to what sslip.io would do, if I am interpreting it correctly: https://coredns.io/plugins/template/#resolve-aptr-for-example

karmab commented 6 months ago

https://github.com/karmab/kcli/commit/5831487033bb0aa40a3ec25530b1969b6a621b11 introduces a dedicated variable named disconnected_vm_name so that you can set the registry fqdn in the disconnected vm to something of your liking.

Making dns tweaks to make this resolvable on the node isn't an option since spawning coredns (static) container in the node requires the corresponding image to be fetchable (on the disconnected registry)

palonsoro commented 6 months ago

Ok. Understood. I fell into a chicken-egg trap.

Now testing...

palonsoro commented 6 months ago

Using the disconnected_vm_name throws errors such as this:

time="2024-04-26T12:01:02-04:00" level=fatal msg="copying image 1/3 from manifest list: trying to reuse blob sha256:3425ef7be5c37050da972ee55ec09abea22f205347baad4898949663c72d8686 at destination: pinging container registry 192-168-196-13.sslip.io:5000: Get \"https://192-168-196-13.sslip.io:5000/v2/\": tls: failed to
 verify certificate: x509: certificate is valid for revali-disconnected.revali.killua.sce, not 192-168-196-13.sslip.io"

So it looks as if it is not replaced in all the places where it should

palonsoro commented 6 months ago

Maybe it is missing here: https://github.com/karmab/kcli/blob/5831487033bb0aa40a3ec25530b1969b6a621b11/kvirt/cluster/openshift/disconnected/bin/sync_image.sh#L2

Which seems to be what syncs what synchronizes some kcli-specific images: https://github.com/karmab/kcli/blob/5831487033bb0aa40a3ec25530b1969b6a621b11/kvirt/cluster/openshift/disconnected/scripts/04_extras.sh

And that would explain why the installation fails due to keepalived image used by kcli not being pullable :-)

karmab commented 6 months ago

good catch! changed now