sodafoundation / installer

provides easy installation and basic deployment based on specific configurations for SODA Projects
Apache License 2.0
35 stars 68 forks source link

Not able to install Ceph using ansible #326

Open PravinRanjan10 opened 4 years ago

PravinRanjan10 commented 4 years ago

Describe the bug A clear and concise description of what the bug is. Tried to install ceph using ansible. At last it fails during sanity check. Please take a look of error below. Interesitng point is health seems to be ok( HEALTH_OK) but other operations failing.

TASK [osdsdock : ceph cluster sanity check] *** task path: /root/opensds-installer/ansible/roles/osdsdock/scenarios/ceph_stabilize.yml:17 fatal: [localhost]: FAILED! => { "msg": "Unexpected templating type error occurred on (INTERVAL={{ ceph_check_interval|quote }}\n MAX_CHECK={{ ceph_check_count|quote }}\n declare -a ceph_stat_array=()\n i=0\n while true\n do\n ceph_stat_array=(sudo ceph -s | awk '/health:/{print $2;}/osd:/{print $2, $4, $6;}')\n # check health status. 3 conditions below\n # 1) HEALTH_OK means healty mon cluster.\n # 2) check joined osd num. At least 1 osd.\n # 3) check joined osds are all up\n if [ \"${ceph_stat_array[0]}\" == \"HEALTH_OK\" ] && [ \"${ceph_stat_array[1]}\" -ge 1 ] && [ \"${ceph_stat_array[1]}\" -eq \"${ceph_stat_array[2]}\" ]; then\n exit 0\n fi\n i=expr ${i} \\+ 1\n if [ \"${i}\" -ge \"${MAX_CHECK}\" ]; then\n exit 1\n fi\n sleep ${INTERVAL}\n done): 'int' object is not iterable" } to retry, use: --limit @/root/opensds-installer/ansible/site.retry PLAY RECAP **** localhost : ok=52 changed=28 unreachable=0 failed=1 root@root1-Latitude-E7450:~/opensds-installer/ansible# dpkg -l|grep ceph ii ceph 13.2.8-1bionic amd64 distributed storage and file system ii ceph-base 13.2.8-1bionic amd64 common ceph daemon libraries and management tools ii ceph-common 13.2.8-1bionic amd64 common utilities to mount and interact with a ceph storage cluster ii ceph-fuse 13.2.8-1bionic amd64 FUSE-based client for the Ceph distributed file system ii ceph-mds 13.2.8-1bionic amd64 metadata server for the ceph distributed file system ii ceph-mgr 13.2.8-1bionic amd64 manager for the ceph distributed storage system ii ceph-mon 13.2.8-1bionic amd64 monitor server for the ceph storage system ii ceph-osd 13.2.8-1bionic amd64 OSD server for the ceph storage system ii libcephfs2 13.2.8-1bionic amd64 Ceph distributed file system client library ii python-ceph-argparse 13.2.8-1bionic amd64 Python 2 utility libraries for Ceph CLI ii python-cephfs 13.2.8-1bionic amd64 Python 2 libraries for the Ceph libcephfs library root@root1-Latitude-E7450:~/opensds-installer/ansible# ceph fs ls No filesystems enabled root@root1-Latitude-E7450:~/opensds-installer/ansible# root@root1-Latitude-E7450:~/opensds-installer/ansible# ceph -s cluster: id: 0becd271-442d-4f38-a9dc-d140a9f4ad46 health: HEALTH_OK services: mon: 1 daemons, quorum root1-Latitude-E7450 mgr: no daemons active osd: 0 osds: 0 up, 0 in

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context Add any other context about the problem here.

PravinRanjan10 commented 4 years ago

@thatsdone As per discussion, please take a look.

thatsdone commented 4 years ago

Some notes from our private discussion for records.

  1. 'ceph cluster sanity check' task does not only cluster status check (MONs) but also OSDs too in order to ensure the overall ceph cluster health sanity. In this case, ceph -s says 'HEALTH_OK', but actually the last line of its output means there are no OSDs up and running. So, even if you didn't get the syntax level error originally you got, you should have had another trouble (saying timeout to waiting for ceph cluster stabilization.)
  2. When I tried to reproduce this issue using Ubuntu 16.04 (not 18.04), I got a different error. It's a kind of timeout exceeded error, and I'm still trying to figure out the root cause.
  3. It's strange to get an error saying 'Unexpected templating type...' like above. I'm suspecting there could be missing parameters, for example, 'ceph_check_interval' etc. So, I asked @PravinRanjan10 to show me 'git diff' result of the working directory of opensds-installer.
PravinRanjan10 commented 4 years ago

@thatsdone Sure, let me reproduce this again. I will share you full log. Thanks for quick look into this.

thatsdone commented 4 years ago

@PravinRanjan10

Three points.

(1) While trying to reproduce this issue, I noticed that ceph mimic checks number of active OSDs by following 'osd_pool_default_size' and this could make ceph -s result HEALTH_WARN. I think you are using enough number of disks (or consistent 'osd_pool_default_size' value), but anyway please pay attention to this topic.

(2) When you try to reproduce the issue, please ensure that you have the following configuration parameters in 'ansible/group_vars/osdsdock.yml'.

(3) From a viewpoint of the new feature that you are working on (ceph fs support), I think it's better to use an existing ceph cluster instead of building that via opensds-installer because ceph setup itself is essentially outside of SODA/opensds. What do you think about?

PravinRanjan10 commented 4 years ago

@thatsdone I agree with your 3rd point, But atleast for testing we need ceph cluster using ansible right?

I will also check other two points while recreating issue.

thatsdone commented 4 years ago

@PravinRanjan10

Hi, Still I'm not sure the root cause of the original symptom (an ansible template error). But, at least, I success fully installed Daito (v.0.11.0) as single node with ceph (1 osd) by specifying 'group_vars/ceph/all.yml' like below.

This is essentially because ceph added minimum osd count check after mimic, and we need to specify 'osd_pool_default_size' in /etc/ceph/ceph.conf. The default is 3, and this means we need at least 3 osds (and osd volumes) without specifying the parameter.

I would suggest to set 'ceph_conf_overrides' as 1 because from a perspective of opensds-installer, ceph cluster is just a mock (normally for testing purpose of other SODS features).

If this makes sense, I can submit a PR.

ubuntu@ubuntu201:~/opensds-installer$ git diff ansible/group_vars/ceph/all.yml
diff --git a/ansible/group_vars/ceph/all.yml b/ansible/group_vars/ceph/all.yml
index c0cc427..49a362e 100644
--- a/ansible/group_vars/ceph/all.yml
+++ b/ansible/group_vars/ceph/all.yml
@@ -25,11 +25,12 @@ dummy:

 ceph_origin: repository
 ceph_repository: community
-ceph_stable_release: luminous
+ceph_stable_release: mimic
 public_network: "{{ ansible_default_ipv4.address }}/24"
 cluster_network: "{{ public_network }}"
 monitor_interface: "{{ ansible_default_ipv4.interface }}"
 devices:
+  - '/dev/vdb'
   #- '/dev/sda'
   #- '/dev/sdb'
 osd_scenario: collocated
@@ -467,7 +468,9 @@ osd_scenario: collocated
 #     bar: 5678
 #
 #ceph_conf_overrides: {}
-
+ceph_conf_overrides:
+    global:
+        osd_pool_default_size: 1

 #############
 # OS TUNING #
ubuntu@ubuntu201:~/opensds-installer$

Thanks in advance, Masanori

thatsdone commented 4 years ago

@kumarashit Hi, if my comment above makes sense, I will create a PR (as I mentioned in my comment). What do you think about?

PravinRanjan10 commented 4 years ago

@thatsdone I think you can go ahead and rasie PR if it is resolving the issue. Then we can also take your code and re-test. But we had also tried by setting variable 'ceph_conf_overrides' as value 1 and 3 previously, and it didn't help.

But, Again, i would recommand please raise your PR.