lae / ansible-role-proxmox

IaC for Proxmox VE clusters.
MIT License
468 stars 139 forks source link

feature: add debian 12 bookworm and proxmox 8 compatibility #230

Closed lexxxel closed 4 months ago

lexxxel commented 1 year ago

resolves: #226

gilou commented 10 months ago

Have you actually tested it so that nothing using the API or the cluster management breaks? I'll do it in a bit myself, and try to review it with those changes.

lexxxel commented 10 months ago

Have you actually tested it so that nothing using the API or the cluster management breaks? I'll do it in a bit myself, and try to review it with those changes.

I use the modified role on my 3 node cluster. Until now I have not found any issue managing lxc or VMs

gilou commented 10 months ago

I'm testing in Vagrant (the Vagrantfile could be upgraded a bit, but maybe we want to test on older versions ;)), and will deploy a few scenarios... First thing, you might want to have the ceph version depend on proxmox' version. Because it's getting annoying, I'd suggest having the ceph release name as a dictionnary, with the keys being Debian ansible_distribution_release values, and the values, Ceph. Something along the lines off:

pve_ceph_version:
  buster: nautilus
  bullseye: pacific
  bookworm: quincy
pve_ceph_repository_line: "deb http://download.proxmox.com/debian/ceph-{{ pve_ceph_version[ansible_distribution_release] }} {{ ansible_distribution_release }} no-subscription"

in defaults/main.yaml

It has another benefit, one can decide they want another version for ceph, and can then override it without overriding the whole line, though I doubt it's actually useful.

lexxxel commented 10 months ago

Ahh, let me clarify, I use btrfs and zfs for storage. I do not have experience with all the other storage providers.

gilou commented 10 months ago

ah, there's more (I fixed the ceph- prefix in my comment): seems like Proxmox switch from using "main" as the channel to no-subscription… I fixed it, I don't know if it would work as is for 6.x/7.x… Also the doc mentions pveceph install to set up the repos.

I'm not deploying ceph using ansible myself, but I'm just playing the default test playbook that enables it. It might be that nobody cares…

lae commented 10 months ago

Have you actually tested it so that nothing using the API or the cluster management breaks?

Yeah, I was personally waiting to test it out but am busy as of late. The Vagrant testing suite also needed to be updated but I figured I'd do that myself.

I'm not deploying ceph using ansible myself, but I'm just playing the default test playbook that enables it. It might be that nobody cares…

People definitely do care. Ceph still needs to be functional on update. #215 is supposed to be an update to the Ceph repos but I'd left a comment there regarding behaviour on existing setups. Also, the role does what pveceph install does under the hood except in a more controlled fashion afaik.

gilou commented 10 months ago

Hey, glad to have you around!

Yeah, I'm getting lazy about the channel name switching from main to no-subscription, and I'm not sure when that happened either (7 ? 8 ?).

And as for vagrant, I'll probably start another conversation, I like how you set it up, and it's probably not too hard to update it, and at some point, automate that ;)

gilou commented 10 months ago

I've tested the proposed PR on Vagrant (Debian bookworm from bento as the base) and on a single node cluster (deployed as Proxmox), it seems to be working fine. I haven't tested ZFS nor ceph, just the basics, but it seems to be looking OK.

gilou commented 10 months ago

I'm not experiencing this on a "real setup", mostly because I don't provision ceph using the role, but here what's a test in VMs does using the role in Vagrant:

TASK [lae.proxmox : Validate and compress new crushmap] ************************
fatal: [pve-1]: FAILED! => {"changed": true, "cmd": ["crushtool", "-c", "crush_map_decompressed", "-o", "new_crush_map_compressed"], "delta": "0:00:00.021260", "end": "2023-08-29 21:25:06.169639", "msg": "non-zero return code", "rc": 1, "start": "2023-08-29 21:25:06.148379", "stderr": "crush_map_decompressed:44 error: parse error at ''", "stderr_lines": ["crush_map_decompressed:44 error: parse error at ''"], "stdout": "", "stdout_lines": []}

I'm not sure I can debug that easily, given the complexity of the regex, but I may try…

Also, and that's probably a change in the API, the tests fail using Vagrant, as it seems to be forbidden to have a "no-content-dir", that is not define a valid content attribute. This is the error I get:

create storage failed: unknown vtype 'none'

This is a non-issue, I'm pretty sure it makes no sense to try and create that anymore.

lae commented 10 months ago

FYI for anyone coming to this issue and wanting to use this immediately, I'll suggest what I've suggested a few times over the years. You can install this PR as a drop-in replacement for lae.proxmox via ansible-galaxy:

ansible-galaxy install https://github.com/lexxxel/ansible-role-netbox/archive/feature/add_bookworm_and_debian_12_compatibility.tar.gz,pr-230,lae.proxmox --force

(which you should upgrade from once this is merged and a new release has been cut)

gilou commented 10 months ago

What I do usually is symlink ~/.ansible/roles/lae.proxmox to my working dir, so I can decide on what branch / git magic I'm operating.

lae commented 10 months ago

What I do usually is symlink ~/.ansible/roles/lae.proxmox to my working dir, so I can decide on what branch / git magic I'm operating.

Yeah, that's what I do in my environment too but that method's really only useful for contributors of the role lol (well, not to say others can't do it either, but for one-off situations installing the git-archive tarball like above is a bit less maintenance)

gilou commented 10 months ago

I'm not experiencing this on a "real setup", mostly because I don't provision ceph using the role, but here what's a test in VMs does using the role in Vagrant:

TASK [lae.proxmox : Validate and compress new crushmap] ************************
fatal: [pve-1]: FAILED! => {"changed": true, "cmd": ["crushtool", "-c", "crush_map_decompressed", "-o", "new_crush_map_compressed"], "delta": "0:00:00.021260", "end": "2023-08-29 21:25:06.169639", "msg": "non-zero return code", "rc": 1, "start": "2023-08-29 21:25:06.148379", "stderr": "crush_map_decompressed:44 error: parse error at ''", "stderr_lines": ["crush_map_decompressed:44 error: parse error at ''"], "stdout": "", "stdout_lines": []}

OK, found the culprit, min_size and max_size are no longer valid in quincy crush maps... So if we want to keep that available in tasks/ceph.yml for former versions, we'll need to add a condition there for the big regex replace to edit the crush rules…

EDIT: fixing that is easy, as we can simply not write the min/max attributes anymore, but maybe we want an upgrade path there…

gilou commented 10 months ago

Something along the lines off:

pve_ceph_version:
  buster: nautilus
  bullseye: pacific
  bookworm: quincy
pve_ceph_repository_line: "deb http://download.proxmox.com/debian/ceph-{{ pve_ceph_version[ansible_distribution_release] }} {{ ansible_distribution_release }} no-subscription"

Actually, I didn't notice we already have that logic in load_variables.yml, so we can simply set the appropriate ceph version in vars/version.yml, and let the user overwrite it if needed…

mrtwnklr commented 7 months ago

A note for users with nfs storage:

I am using a fresh PVE 8 installation (vagrant based) with nfs storage. My playbook fails when executing the task 'Configure Proxmox Storage'. Debugging the issue showed that the creation of the first nfs mount via pvesh on the shell lead to the following result:

$ pvesh create storage --storage nas-downloads --output=json --content=iso,vztmpl --export=/volume1/homeserver-shared --server=10.10.10.10 --type=nfs
Created symlink /run/systemd/system/remote-fs.target.wants/rpc-statd.service → /lib/systemd/system/rpc-statd.service.
{"storage":"nas-data","type":"nfs"}

echo $? gives 0 but output is interpreted as error by lae.proxmox/module_utils/pvesh.py#run_command(). I "fixed" this by starting rpc-statd service prior executing lae.proxmox:

- name: Start rpc-statd.service
  ansible.builtin.service:
    name: rpc-statd
    enabled: true
    state: started
btravouillon commented 4 months ago

Something along the lines off:

pve_ceph_version:
  buster: nautilus
  bullseye: pacific
  bookworm: quincy
pve_ceph_repository_line: "deb http://download.proxmox.com/debian/ceph-{{ pve_ceph_version[ansible_distribution_release] }} {{ ansible_distribution_release }} no-subscription"

Actually, I didn't notice we already have that logic in load_variables.yml, so we can simply set the appropriate ceph version in vars/version.yml, and let the user overwrite it if needed…

That's a nice proposal to update the defaults with pve_ceph_version. AFAIK, it's not possible for a user to replace a variable defined in vars/, so I would stick with a sane default and allow to change the Ceph release. Our team is currently testing the upgrade from 7.x to 8.x and we had to upgrade ceph to quincy first. We defined a new value for pve_ceph_repository_line, but it would have been more elegant to define pve_ceph_version['bullseye'] = 'quincy'.

@lexxxel do you think you could add this to the PR?

btravouillon commented 4 months ago

With Proxmox VE EOL: 2024-07 coming soon, is there any plan to merge this PR and release a new version of the role?

FWIW, our team has been able to use this PR to upgrade our lab cluster (5 nodes) from PVE 7.4 to 8.1 and Ceph to 17.2 with the role.

lae commented 4 months ago

sorry for taking so long on this.

tbjers commented 4 months ago

sorry for taking so long on this.

You're here now, that's all that matters! ❤️