shiftstack / dev-install

13 stars 16 forks source link

Align Ceph deployment to OSP 17 #195

Closed fmount closed 1 year ago

fmount commented 2 years ago

This patch adds a new Ceph role which is responsible to execute a set of tasks and deploy the cluster in the standalone scenario. This works aligns dev-install to the new approach introduced in Wallaby/Zed, where the Ceph bootstrap is not part of the overcloud deploy anymore.

In particular, as per [1], Ceph expects to see:

  1. an already provisioned metal
  2. an already provisioned storage_network
  3. reserved VIPs (if needed)

There are still actions happening during the overcloud deployment (pool creation, ganesha deployment) to finalize the cloud, but the initial bootstrap process has been moved to a set of overcloud ceph related commands, executed through the related tripleo.operator.

Note that a similar approach has been recently introduced in oooq [2].

[1] https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/deployed_ceph.html#deployed-ceph-workflow [2] https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/standalone/tasks/ceph-install.yml

Signed-off-by: Francesco Pantano fpantano@redhat.com

fmount commented 2 years ago

@gouthampacha FYI

mdbooth commented 2 years ago

This looks awesome. Does it still work for 16.2?

fmount commented 2 years ago

This looks awesome. Does it still work for 16.2?

I didn't try to deploy against a 16.2 compose, but I'm assuming it should work because in that case the cephadm role is skipped (as the entire process happens during the regular overcloud deployment). It's worth trying this code against a 16.2 compose though, just to make sure we're not introducing particular issues.

fmount commented 2 years ago

This looks awesome. Does it still work for 16.2?

I didn't try to deploy against a 16.2 compose, but I'm assuming it should work because in that case the cephadm role is skipped (as the entire process happens during the regular overcloud deployment). It's worth trying this code against a 16.2 compose though, just to make sure we're not introducing particular issues.

Quick update here:

In a 16.2 env I tried I definitely see:

[stack@standalone ~]$ rpm -qa | grep ceph-ansible ceph-ansible-4.0.70.3-1.el8cp.noarch

[stack@standalone ~]$ cat standalone_parameters.yaml
...
...
...
  CephAnsibleDisksConfig:
    osd_scenario: lvm
    osd_objectstore: bluestore
    devices:
    - /dev/sdb
    - /dev/sdc
    - /dev/sdd
  CephAnsibleExtraConfig:
    cluster_network: 192.168.24.0/24
    public_network: 192.168.24.0/24
    ceph_nfs_bind_addr: "10.1.27.25"
    ceph_nfs_docker_extra_env: "--pids-limit=0"

  CephPoolDefaultPgNum: 32
  CephPoolDefaultSize: 1
...
...

and the overcloud deploy command looks like:

[stack@standalone ~]$ cat tripleo_deploy.sh | grep ceph-ansible
sudo openstack tripleo deploy  --templates $DEPLOY_TEMPLATES --standalone  --yes --output-dir $DEPLOY_OUTPUT_DIR  --stack $DEPLOY_STACK --standalone-role $DEPLOY_STANDALONE_ROLE --timeout $DEPLOY_TIMEOUT_ARG -e /usr/share/openstack-tripleo-heat-templates/environments/standalone/standalone-tripleo.yaml -e /home/stack/containers-prepare-parameters.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/external-network-vip.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-mds.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/manila-cephfsganesha-config.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/octavia.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e /home/stack/standalone_parameters.yaml -r $DEPLOY_ROLES_FILE -n $DEPLOY_NETWORKS_FILE     --deployment-user $DEPLOY_DEPLOYMENT_USER  --local-ip $DEPLOY_LOCAL_IP --control-virtual-ip $DEPLOY_CONTROL_VIP --public-virtual-ip $DEPLOY_PUBLIC_VIP    --keep-running     >/home/stack/standalone_deploy.log 2>&1

where ceph-ansible is included.

Ceph is Nautilus:

[root@standalone /]# ceph versions
{
    "mon": {
        "ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)": 1
    },
    "mgr": {
        "ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)": 6
    }
}

and I see a live execution running w/o issues [0], so it looks safe to me in terms of backwards compatibility.

[0] https://paste.opendev.org/show/bSkRuAMB4LJdrQwCiDQ1/

EmilienM commented 2 years ago

A few notes:

vkmc commented 1 year ago

Redeploying OSP 17.0 with dev-install, I noticed this patch haven't merged yet. What's the status of this?

EmilienM commented 1 year ago

I'm reviewing it.

EmilienM commented 1 year ago

I'm testing this against OSP 16 and then OSP 17, to make sure it all works.

EmilienM commented 1 year ago

tested on OSP16.2 with my usual templates, ceph is healthy and everything seems fine.

sh-4.4# ceph -s
  cluster:
    id:     4be5a2e7-1de2-447a-819c-ab5f16f08f40
    health: HEALTH_WARN
            4 pool(s) have no replicas configured
            mon is allowing insecure global_id reclaim

  services:
    mon: 1 daemons, quorum foch (age 118m)
    mgr: foch(active, since 117m)
    mds: cephfs:1 {0=foch=up:active}
    osd: 1 osds: 1 up (since 117m), 1 in (since 117m)

  data:
    pools:   4 pools, 32 pgs
    objects: 239 objects, 1.6 GiB
    usage:   6.6 GiB used, 93 GiB / 100 GiB avail
    pgs:     32 active+clean

  io:
    client:   76 MiB/s rd, 85 B/s wr, 16 op/s rd, 0 op/s wr
EmilienM commented 1 year ago

Now trying OSP 17.1 with RHEL 9.2 and this PR.

EmilienM commented 1 year ago

This was tested against:

All good.