ionos-cloud / cluster-api-provider-proxmox

Cluster API Provider for Proxmox VE (CAPMOX)
Apache License 2.0
194 stars 24 forks source link

Possible wrong Cloud-Init network-config generated file #51

Open ricottatosta opened 10 months ago

ricottatosta commented 10 months ago

What steps did you take and what happened: Sorry I submit this as a bug, but maybe it isn't. When deploying a cluster with Talos as provider for bootstrap and controlplane, Talos' init process finds a cloud-init drive, but then complains about network-config file. Talos' error says: "network-config metadata version=0 is not supported", maybe because it starts with "network:". Is it supported? Reading the manual, it shouldn't.

cloud-init manual

What did you expect to happen: Maybe cluster deployment should generate a network-config file starting without a top level "network:".

Environment:

mcbenjemaa commented 10 months ago

@ricottatosta

As I can see, you're trying to use Talos as a control-plane and bootstrap provider. It's recommended that you submit this issue in talos-control-plane-provider

ricottatosta commented 10 months ago

Thanks for changing my submission with type 'support'. The issue involves cidata.iso created by capmox to pass configuration via cloud-init, and talos vm boot image that uses it at boot time. Or am I supposed to believe that Talos bootstrap and controlplane are involved in creating cidata.iso? Maybe the issue should be submitted in "talos os".

mcbenjemaa commented 10 months ago

Hmmm, CAPMOX Now only creates cidata supported by Debian and Ubuntu distributions. If it's related to supporting Talos Cloud-init, you're right.

I don't know how Talos load the cloud-init iso,

If you know something, I can help you add support for Talos.

ricottatosta commented 10 months ago

As far as I know, Talos expects cloud-init datasource to be in "nocloud" format. In this format, network-config file can't start with "network:". Because of the presence of "network:", Talos doesn't find the top level key "version: 2" and assumes it to be "version: 0". I can do nothing, that's the way you build cidata.iso and depends on how cloud-init expects data to be arranged in distributions like Debian and Ubuntu. You should find a way, either arranging data in a way that is good for all situations or let the user choose what kind of datasource format to produce in cidata.iso. I know this means more effort. But I think Talos and CAPMOX are great together.

ricottatosta commented 10 months ago

A couple of usefull links: Talos NoCloud Cloud-Init NoCloud datasource

mcbenjemaa commented 10 months ago

Will check this out

ricottatosta commented 10 months ago

This is the console output of Talos NoCloud booting image regarding the issue:

[talos] found config disk (cidata) at /dev/sr0
ISO 9660 Extensions: IEEE_P1282
[talos] fetching meta config from: cidata/meta-data
[talos] fetching network config from: cidata/network-config
[talos] fetching machine config from: cidata/user-data
[talos] restarting platform network config {"component": "controller-runtime", "controller": "network.PlatformConfigController", ..., "error": "network-config metadata version=0 is not supported"}
mcbenjemaa commented 10 months ago

I will check this if it works with other netplans distros, and then I will implement it

ricottatosta commented 10 months ago

I'm really looking forward to hearing good news from you.

mcbenjemaa commented 10 months ago

I will test it on ubuntu but can you tested on Talos,

ricottatosta commented 10 months ago

What am I supposed to test? Release 0.1.1? Does it contain any changes in the code involved in the issue?

mcbenjemaa commented 10 months ago

Currently, we support only Netplan-based distros, We will try to add more support for other network-configs.

If someone wants to take effort and add this, it will be great.

ricottatosta commented 10 months ago

OK. It was expected. Unfortunately I'm not good at programming in Go. And the solution I found out is not suitable for being integrated in your code. As my solution is quite simple (just delete a bunch of characters from a template string), I'll patch your code every time you release a new version. Thank you for your support.

mcbenjemaa commented 10 months ago

We will try to support this soon.

ricottatosta commented 10 months ago

I would like to share my experience about using CAPMOX with TALOS. This is the template I use in pkg/cloudinit/network.go:

version: 2
ethernets:
{{- range $index, $element := .NetworkConfigData }}
  eth{{ $index }}:
    match:
      macaddress: {{ $element.MacAddress }}
    dhcp4: false
    addresses:
    {{- if $element.IPAddress }}
      - {{ $element.IPAddress }}
    {{- end }}
    {{- if $element.IPV6Address }}
      - {{ $element.IPV6Address }}
    {{- end }}
  {{- if eq $index 0 }}
    {{- if $element.Gateway }}
    gateway4: {{ $element.Gateway }}
    {{- end }}
    {{- if $element.Gateway6 }}
    gateway6: {{ $element.Gateway6 }}
    {{- end }}
    {{- if $element.DNSServers }}
    nameservers:
      addresses:
      {{- range $element.DNSServers }}
        - {{ . }}
      {{- end -}}
    {{- end -}}
  {{- end -}}
{{- end -}}

TALOS dislikes defining static routes in place of gateways and perhaps even defining nameservers for each device. For the rest, it works like a charm.

mcbenjemaa commented 10 months ago

Thanks for sharing, I guess the best solution for this is to support another version of network-config thats actually different from the netplan config.

mcbenjemaa commented 9 months ago

We have released a new version.

ricottatosta commented 9 months ago

I'm trying to make nocloud template string work like the netplan one as much as possible. But there is an issue. If I omit to define a gateway, the injector refuses to build the cloudinit image complaining it wants it. Is there a way to define only address and netmask for a network device?

mcbenjemaa commented 9 months ago

Oh, no The gateway is required

mcbenjemaa commented 9 months ago

I will create new issue, so we can take this as a feature to support multiple cloud-init network-config.

However, DHCP will be included in the next release, i don't know if that helps you.

mcbenjemaa commented 9 months ago

Issue is added

https://github.com/ionos-cloud/cluster-api-provider-proxmox/issues/94

65278 commented 6 months ago

I'm trying to make nocloud template string work like the netplan one as much as possible. But there is an issue. If I omit to define a gateway, the injector refuses to build the cloudinit image complaining it wants it. Is there a way to define only address and netmask for a network device?

One workaround I've found is to add an illegal gateway (169.254.255.254 for example). Cluster-api-provider-ipam-in-cluster will accept this, and netplan will ignore it when applying with a warning. I have not tested this with cloud-init network config v1.

isZumpo commented 6 months ago

I have been trying to get the talos bootstrap provider and proxmox infrastructure provider together the whole morning. Running into the network-config metadata version=0 is not supported issue... So glad to finally find this github issue about it, meaning that I probably did nothing wrong in my configuration :)

Has there been any progress on the topic since January? Could possibly some sort of variable be added that makes it generate the cidata in the nocloud format such that it is compatible with talos?

wikkyk commented 5 months ago

Unfortunately not. We don't use Talos and as such we can't commit to adding support for it. Patches welcome, of course, and we are more than happy to accept additional maintainers :-)

ricottatosta commented 5 months ago

CAPMOX works well with TALOS. But it needs some patches to the CAPMOX's code. Furthermore, it is possible to make your cluster "elastic", even self-managed (without an external cluster that creates and manages it). All it needs to do is modifying network.go file located at pkg/cloudinit/ in the source code and rebuild docker image. If something ready to use is needed, there is a docker image at https://hub.docker.com/r/ricottatosta/cluster-api-provider-proxmox already patched for TALOS. After deploying CAPMOX, patch CAPMOX deployment manifest and let its container image point at ricottatosta/cluster-api-provider-proxmox:[tag]. Last version (0.3.0) has the following patch applied:

...
const (
    /* network-config template. */
    networkConfigTPl = `version: 2
renderer: networkd
ethernets:
{{- range $index, $element := .NetworkConfigData }}
  eth{{ $index }}:
    match:
      macaddress: {{ $element.MacAddress }}
    dhcp4: {{ if $element.DHCP4 }}true{{ else }}false{{ end }}
    dhcp6: {{ if $element.DHCP6 }}true{{ else }}false{{ end }}
  {{- if or (and (not $element.DHCP4) $element.IPAddress) (and (not $element.DHCP6) $element.IPV6Address) }}
    addresses:
    {{- if $element.IPAddress }}
      - {{ $element.IPAddress }}
    {{- end }}
    {{- if $element.IPV6Address }}
      - '{{ $element.IPV6Address }}'
    {{- end }}
  {{- if eq $index 0 }}
    {{- if and $element.Gateway (not $element.DHCP4) }}
    gateway4: {{ $element.Gateway }}
    {{- end }}
    {{- if and $element.Gateway6 (not $element.DHCP6) }}
    gateway6: '{{ $element.Gateway6 }}'
    {{- end }}
    {{- if $element.DNSServers }}
    nameservers:
      addresses:
      {{- range $element.DNSServers }}
        - {{ . }}
      {{- end -}}
    {{- end -}}
  {{- end -}}
  {{- end -}}
{{- end -}}
{{- $vrf := 0 -}}
{{- range $index, $element := .NetworkConfigData }}
{{- if eq $element.Type "vrf" }}
{{- if eq $vrf 0 }}
vrfs:
{{- $vrf := 1 }}
{{- end }}
  {{$element.Name}}:
    table: {{ $element.Table }}
    {{- if $element.Routes }}{{ template "routes" $element }}{{- end -}}
    {{- if $element.FIBRules }}{{ template "rules" $element }}{{- end -}}
    {{- if $element.Interfaces }}
    interfaces:
    {{- range $element.Interfaces }}
      - {{ . }}
    {{- end -}}
    {{- end -}}
{{- end -}}
{{- end -}}
{{- define "rules" }}
    routing-policy:
    {{- range $index, $rule := .FIBRules }}
      - {
      {{- if $rule.To }} "to": "{{$rule.To}}", {{ end -}}
      {{- if $rule.From }} "from": "{{$rule.From}}", {{ end -}}
      {{- if $rule.Priority }} "priority": {{$rule.Priority}}, {{ end -}}
      {{- if $rule.Table }} "table": {{$rule.Table}}, {{ end -}} }
    {{- end }}
{{- end -}}
{{- define "routes" }}
    routes:
    {{- range $index, $route := .Routes }}
      - {
      {{- if $route.To }} "to": "{{$route.To}}", {{ end -}}
      {{- if $route.Via }} "via": "{{$route.Via}}", {{ end -}}
      {{- if $route.Metric }} "metric": {{$route.Metric}}, {{ end -}}
      {{- if $route.Table }} "table": {{$route.Table}}, {{ end -}} }
    {{- end }}
{{- end -}}
`
)
...

It's not tested against vrf. My use case is two ethernets, public and private. As mentioned earlier, it works like a charm.

isZumpo commented 5 months ago

CAPMOX works well with TALOS. But it needs some patches to the CAPMOX's code. Furthermore, it is possible to make your cluster "elastic", even self-managed (without an external cluster that creates and manages it). All it needs to do is modifying network.go file located at pkg/cloudinit/ in the source code and rebuild docker image. If something ready to use is needed, there is a docker image at https://hub.docker.com/r/ricottatosta/cluster-api-provider-proxmox already patched for TALOS. After deploying CAPMOX, patch CAPMOX deployment manifest and let its container image point at ricottatosta/cluster-api-provider-proxmox:[tag]. Last version (0.3.0) has the following patch applied:

...
const (
  /* network-config template. */
  networkConfigTPl = `version: 2
renderer: networkd
ethernets:
{{- range $index, $element := .NetworkConfigData }}
  eth{{ $index }}:
    match:
      macaddress: {{ $element.MacAddress }}
    dhcp4: {{ if $element.DHCP4 }}true{{ else }}false{{ end }}
    dhcp6: {{ if $element.DHCP6 }}true{{ else }}false{{ end }}
  {{- if or (and (not $element.DHCP4) $element.IPAddress) (and (not $element.DHCP6) $element.IPV6Address) }}
    addresses:
    {{- if $element.IPAddress }}
      - {{ $element.IPAddress }}
    {{- end }}
    {{- if $element.IPV6Address }}
      - '{{ $element.IPV6Address }}'
    {{- end }}
  {{- if eq $index 0 }}
    {{- if and $element.Gateway (not $element.DHCP4) }}
    gateway4: {{ $element.Gateway }}
    {{- end }}
    {{- if and $element.Gateway6 (not $element.DHCP6) }}
    gateway6: '{{ $element.Gateway6 }}'
    {{- end }}
    {{- if $element.DNSServers }}
    nameservers:
      addresses:
      {{- range $element.DNSServers }}
        - {{ . }}
      {{- end -}}
    {{- end -}}
  {{- end -}}
  {{- end -}}
{{- end -}}
{{- $vrf := 0 -}}
{{- range $index, $element := .NetworkConfigData }}
{{- if eq $element.Type "vrf" }}
{{- if eq $vrf 0 }}
vrfs:
{{- $vrf := 1 }}
{{- end }}
  {{$element.Name}}:
    table: {{ $element.Table }}
    {{- if $element.Routes }}{{ template "routes" $element }}{{- end -}}
    {{- if $element.FIBRules }}{{ template "rules" $element }}{{- end -}}
    {{- if $element.Interfaces }}
    interfaces:
    {{- range $element.Interfaces }}
      - {{ . }}
    {{- end -}}
    {{- end -}}
{{- end -}}
{{- end -}}
{{- define "rules" }}
    routing-policy:
    {{- range $index, $rule := .FIBRules }}
      - {
      {{- if $rule.To }} "to": "{{$rule.To}}", {{ end -}}
      {{- if $rule.From }} "from": "{{$rule.From}}", {{ end -}}
      {{- if $rule.Priority }} "priority": {{$rule.Priority}}, {{ end -}}
      {{- if $rule.Table }} "table": {{$rule.Table}}, {{ end -}} }
    {{- end }}
{{- end -}}
{{- define "routes" }}
    routes:
    {{- range $index, $route := .Routes }}
      - {
      {{- if $route.To }} "to": "{{$route.To}}", {{ end -}}
      {{- if $route.Via }} "via": "{{$route.Via}}", {{ end -}}
      {{- if $route.Metric }} "metric": {{$route.Metric}}, {{ end -}}
      {{- if $route.Table }} "table": {{$route.Table}}, {{ end -}} }
    {{- end }}
{{- end -}}
`
)
...

It's not tested against vrf. My use case is two ethernets, public and private. As mentioned earlier, it works like a charm.

Thanks! This patch appears to be working. I am now getting past the point where it was complaining that: "network-config metadata version=0 is not supported" :)

On a related point, how are you setting up the initial network for the control plane using Talos and Proxmox? I am attempting to use the VIP solution built into Talos but it seems to not be working... If you don't mind it would be very nice to see an example of your TalosControlPlane and ProxmoxCluster objects

ricottatosta commented 5 months ago

For networking I use Cilium without kube-proxy. For VIP I use kube-vip in BGP mode as daemonset. Following is what you asked for:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: ProxmoxCluster
metadata:
  name: k8s-test
  namespace: k8s-test
spec:
  allowedNodes:
    - pve1
    - pve2
    - pve3
  controlPlaneEndpoint:
    host: 10.100.150.150 (my VIP)
    port: 6443
  dnsServers:
    - 10.100.150.1
  ipv4Config:
    addresses:
      - 10.100.150.151-10.100.150.159
    gateway: 10.100.150.254
    prefix: 24

apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: k8s-test-control-plane
  namespace: k8s-test
spec:
  controlPlaneConfig:
    controlplane:
      generateType: controlplane
      talosVersion: v1.6.1
      configPatches:
        - op: add
          path: /machine/network/extraHostEntries
          value:
            - ip: 127.0.0.1
              aliases:
                - kubernetes
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: ProxmoxMachineTemplate
    name: k8s-test-control-plane
  replicas: 1
  version: 1.28.3
wikkyk commented 5 months ago

@ricottatosta Thank you for this - can you submit it as a PR, please?