lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.81k stars 224 forks source link

Add support for CephFS volumes / sub-volumes #1023

Open benaryorg opened 4 months ago

benaryorg commented 4 months ago

Required information

Issue description

CephFS has changed its mount string in Quincy, the version that has recently reached its estimated EoL date (current being Reef, Squid is upcoming AFAIK). This means that any still active release (talking about upstream, not distros) has a mount string that is different from the one Incus is using right now.

This leads to users having a really hard time trying to mount CephFS created via the newer CephFS Volumes/Subvolumes mechanic (at least I haven't gotten it working yet).

As described in the discussion boards the old syntax was:

[mon1-addr]:3300,[mon2-addr]:3300,[mon3-addr]:3300:/path/to/thing

and a lot of options via the -o parameter (or the appropriate field in the mount syscall). Notably Incus does not rely on the config file for this but manually scrapes the mon addresses out of the config file (which has its own issues because the used string matching is insufficient to catch an initial mon list which then refers to the mons by name and the mons being listed in their own sections with their addresses directly as mon_addr, which means that while mount.ceph can just mount the volume, Incus fails during the parsing phase of the config file.

The new syntax is:

user@fsid.cephfsname=/path/to/thing

So with the user, the (optional) fsid, and the cephfs name being encoded into the string there are a few less options, although they do still exist.

Steps to reproduce

  1. run CephFS on ≥Quincy
  2. create CephFS volume and subvolume
  3. try to mount it

With vaguely correct seeming parameters provided to Incus this will still lead to interesting issues like getting No Route to Host errors despite everything being reachable. Honestly, if you find options that manage to mount that, please tell me because I can't seem to find any.

Information to attach

Any relevant kernel output (dmesg) ```text [ +13.628392] libceph: mon0 (1)[2001:db8::1:0]:3300 socket closed (con state V1_BANNER) [ +0.271853] libceph: mon0 (1)[2001:db8::1:0]:3300 socket closed (con state V1_BANNER) [ +0.519922] libceph: mon0 (1)[2001:db8::1:0]:3300 socket closed (con state V1_BANNER) [ +0.520979] ceph: No mds server is up or the cluster is laggy ```
Main daemon log (at /var/log/incus/incusd.log) ```text Jul 19 20:32:09 lxd2 incusd[10412]: time="2024-07-19T20:32:09Z" level=error msg="Failed mounting storage pool" err="Failed to mount \"[2001:41d0:700:2038::1:0]:3300,[2001:41d0:1004:1a22::1:1]:3300,[2001:41d0:602:2029::1:2]:3300:/\" on \"/var/lib/incus/storage-pools/cephfs\" using \"ceph\": invalid argument" pool=cephfs ```
benaryorg commented 4 months ago

For completeness sake, here are some commands to get a new CephFS volume and subvolume stuff up and running and how the final mount command might look like (I'm fumbling that out of my history, not guaranteed to be 100% accurate):

ceph fs volume create volume-name
ceph fs subvolumegroup create volume-name subvolume-group-name
ceph fs subvolume create volume-name subvolume-name --group_name subvolume-group-name

# this will now spit out a path including the UUID of the subvolume:
ceph fs subvolume getpath volume-name subvolume-name --group_name subvolume-group-name
# then authorize a new client (syntax changes slightly in upcoming version)
ceph fs authorize volume-name client.client-name /volumes/subvolume-group-name/subvolume-name/e7c5cd0c-10fa-42e2-9d48-902544f13d07 rw
# which can be mounted like (fsid can be omitted if it is in ceph.conf, key will be read from keyring in /etc/ceph too):
mount -t ceph client-name@.volume-name=/volumes/subvolume-group-name/subvolume-name/e7c5cd0c-10fa-42e2-9d48-902544f13d07 /mnt
tregubovav-dev commented 4 months ago

Just a question: what is the use-case blocked? I actively use CephFS storage pool with my Incus + Microceph deployment (as well as with LXD + Microceph in the past) and I do not see any issues. All such volumes mounted to the instances.

benaryorg commented 4 months ago

Just a question: what is the use-case blocked? I actively use CephFS storage pool with my Incus + Microceph deployment (as well as with LXD + Microceph in the past) and I do not see any issues. All such volumes mounted to the instances.

How does your storage configuration look like? I've tried several permutations that looked like they could work, but considering that I also managed to drop down to incus admin sql to be able to delete the storage pool (which got stuck in pending forever) once I did not try everything.

tregubovav-dev commented 4 months ago

How does your storage configuration look like? I've tried several permutations that looked like they could work, but considering that I also managed to drop down to incus admin sql to be able to delete the storage pool (which got stuck in pending forever) once I did not try everything.

My cluster configuration is:

Steps to create storage pool and deploy instances with sharing files using CephFS volumes

  1. You need to have existing cephfs volume. In my case it's:
    $sudo ceph fs ls
    name: lxd_test_shared, metadata pool: lxd_test_shared_pool_meta, data pools: [lxd_test_shared_pool_data ]
  2. Create storage pool:
    for i in {1..7}; do incus storage create test_shared_vols cephfs source=lxd_test_shared --target cl-0$i; done \
    && incus storage create test_shared_vols cephfs
  3. You can create storage volume when pool is created. (I sue separate project for volume and instances used it): incus storage volume create test_shared_vols test_vol1 size=256MiB --project test
  4. Create and instances, and attach volume to them:
    for i in {1..7}; do inst=test-ct-0$i; \
    echo "Launching instance: $inst"; incus launch images:alpine/edge $inst --project test; \
    echo "Attaching 'test_vol1' to the instance"; incus storage volume attach test_shared_vols test_vol1 $inst data "/data" --project test; \
    echo "Listing content of '/data' directory:"; incus exec $inst --project test -- ls -l /data; \
    done
  5. Put a file to shared volume using: incus exec test-ct-04 --project test -- sh -c 'echo -e "This is a file\n placed tothe shared volume.\n It is acceccible from any instance where this volume is attached.\n" > /data/test.txt'
  6. Check file existence and it's content on each node:
    
    for i in {1..7}; do inst=test-ct-0$i; echo "Listing content of '/data' directory in the $inst instance"; incus exec $inst --project test -- ls -l /data; done

for i in {1..7}; do inst=test-ct-0$i; echo "--- Printing content of '/data/test.txt' file in the $inst instance ---"; incus exec $inst --project test -- cat /data/test.txt; done

benaryorg commented 4 months ago

So far it does not look like you are using the ceph fs volume feature (at least not with subvolumes), otherwise your CephFS paths would include a UUID somewhere. Besides, using the admin credentials would side-step any mounting issues that I'm seeing because you would be able to mount the root of the CephFS even if trying to mount a CephFS subvolume. If you create a subvolume as per my first reply in the post, then you will have credentials that do not have access to the root of the CephFS, making you unable to use the storage configuration you provided (since that one does not contain any paths, and therefore would fail to mount for lack of permissions) as far as I can tell.

tregubovav-dev commented 4 months ago

So far it does not look like you are using the ceph fs volume feature

Yes, you are correct. This why I asked about your use-case.

benaryorg commented 4 months ago

Yes, you are correct. This why I asked about your use-case.

Ah, I see. The primary advantage to me personally is that I don't have to manually lay out a directory structure (i.e. I do not have to actually mount the CephFS with elevated privileges such as client.admin to administrate it), the quota support is baked in, and authorization of individual clients for shares becomes programmatic over that specific API (i.e. less worrying about adding or removing caps outside the CephFS system).

If I were to automate Incus cluster deployment (or even just deployment for individual consumers of CephFS, and also want to handle Incus in the same way), I could instead use the Restful API module of the MGR for many operations in a way that is much less error prone than the API is for managing CephFS otherwise; I wouldn't need to create individual directory trees, and I would not have to enforce a certain convention for how the trees are laid out (since volumes have their very specific layout). Quota management also becomes less of a "have to write xattr of specific directory" and much more tightly attached to the subvolume. The combination of getpath and the way the auth management is handled also makes it a little harder to accidentally use the wrong path or something. This is mostly about automation and programmatically handling things, which is in line with what OpenStack Manila wants for its backend.

Especially when administrating a Ceph cluster on a team with several admins however the added constraints make it much easier to work as a team since there are no strict conventions to stick to oneself, because Ceph already enforces those.

Being able to create multiple volumes, each of which comes with its own pools and MDSs, also greatly improves how things work when you have to separate tenants for whatever reason. Given that it's often beneficial to run one big Ceph cluster instead of many small ones (due to the increase in failure domains) I can see how some of the customers I worked with would like to use that feature (granted, none of those customers were using Incus though), and with any newer clusters I would absolutely recommend using volumes even if just for the reason that you don't have to go back and reintroduce and clean up every part where things weren't properly separated later on (since inevitably every user of Ceph at some points needs some level of isolation for whatever reason, I've never not seen it happen).

In short; it makes me not trip over my own feet when adding a new isolated filesystem share by taking care of the credential-management, directory creation, and quotas, something which I'd surely manage to at least once mess up and like.… delete the client.ceph credentials or something (which wouldn't be possible with the ceph fs deauthorize command as far as I can tell).

TL;DR: it's just more robust as soon as you need to have separate shares for different clients and makes managing the cluster easier if there is a strong separation of concerns.

tregubovav-dev commented 4 months ago

Ah, I see.

I appreciated your detailed explanation.