gluster / gluster-block

A framework for gluster block storage
GNU General Public License v2.0
74 stars 32 forks source link

reload: add ability to reload a single block volume #252

Closed pkalever closed 4 years ago

pkalever commented 5 years ago

What does this PR achieve? Why do we need it?

Problem:

Right now, if any block volume is failed to load as part of service bringup or node reboot, may be because of an issue from the backend, then there is no way to reload that single block volume, we need to reload/restart all the\ block volumes present in the node just to load one block volume, this interrupts ongoing I/O (via this path) for all the volumes hosted within the node.

Solution:

Add ability to reload a single block volume without touching other block volumes in the node.

$ gluster-block help usage: gluster-block [timeout ] <volname[/blockname]> [] [--json*]

commands: [...] reload <volname/blockname> reload a block device. [...]

Notes for the reviewer

TODO:

Signed-off-by: Prasanna Kumar Kalever prasanna.kalever@redhat.com

pkalever commented 5 years ago

@lxbsz please help review, note the TODO items above, will add them soon. Thanks!

lxbsz commented 5 years ago

@pkalever

There has some build warnings:

block_version.c: In function 'glusterBlockBuildMinCaps':
block_version.c:30:3: warning: enumeration value 'RELOAD_SRV' not handled in switch [-Wswitch]
   30 |   switch (opt) {
      |   ^~~~~~
  CC       libgbrpc_la-block_modify.lo
  CC       libgbrpc_la-block_genconfig.lo
block_genconfig.c: In function 'block_gen_config_cli_1_svc':
block_genconfig.c:66:43: warning: '%s' directive output may be truncated writing up to 255 bytes into a region of size 239 [-Wformat-truncation=]
   66 |   snprintf(lun_so, 256, "/backstores/user/%s", block);
      |                                           ^~
In file included from /usr/include/stdio.h:867,
                 from ../utils/utils.h:16,
                 from ../utils/common.h:15,
                 from block_common.h:16,
                 from block_genconfig.c:12:
/usr/include/bits/stdio2.h:67:10: note: '__builtin___snprintf_chk' output between 18 and 273 bytes into a destination of size 256
   67 |   return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   68 |        __bos (__s), __fmt, __va_arg_pack ());
      |        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  CC       libgbrpc_la-block_reload.lo

Thanks.

lxbsz commented 5 years ago

I just backup and delete the /etc/target/saveconfig.json, restart the node.163:

# targetcli ls
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 0]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  | o- user:fbo ............................................................................................... [Storage Objects: 0]
  | o- user:glfs .............................................................................................. [Storage Objects: 0]
  | o- user:qcow .............................................................................................. [Storage Objects: 0]
  | o- user:zbc ............................................................................................... [Storage Objects: 0]
  o- iscsi ............................................................................................................ [Targets: 0]
  o- loopback ......................................................................................................... [Targets: 0]
  o- vhost ............................................................................................................ [Targets: 0]
  o- xen-pvscsi ....................................................................................................... [Targets: 0]

Then in another node.164 to run the reload, but it failed:

# gluster-block reload repvol/block0
FAILED ON:   192.168.195.163
SUCCESSFUL ON:   192.168.195.162 192.168.195.164
RESULT: FAIL

The logs in node.163 are:

[2019-09-16 00:58:53.264341] INFO: reload request, blockname=block0 filename=5bf22bfa-965a-48c3-8a43-f0d5018e9303 [at block_reload.c+366 :<block_reload_1_svc_st>]
[2019-09-16 00:58:53.683346] ERROR: Block 'block0' may be not loaded. [at block_svc_routines.c+111 :<blockCheckBlockLoadedStatus>]
[2019-09-16 00:58:53.686660] ERROR: Block 'block0' not loaded. [at block_svc_routines.c+162 :<blockCheckBlockLoadedStatus>]

And after I just copied the backuped saveconfig.json back to /etc/target/, then restart the gluster-blockd and tcmu-runner services, it can load them back successfully on node.163:

# targetcli ls
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 0]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  | o- user:fbo ............................................................................................... [Storage Objects: 0]
  | o- user:glfs .............................................................................................. [Storage Objects: 0]
  | o- user:qcow .............................................................................................. [Storage Objects: 0]
  | o- user:zbc ............................................................................................... [Storage Objects: 0]
  o- iscsi ............................................................................................................ [Targets: 0]
  o- loopback ......................................................................................................... [Targets: 0]
  o- vhost ............................................................................................................ [Targets: 0]
  o- xen-pvscsi ....................................................................................................... [Targets: 0]
# systemctl daemon-reload; systemctl restart tcmu-runner gluster-blockd; systemctl status tcmu-runner gluster-blockd
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-09-16 09:03:03 CST; 191ms ago
     Docs: man:tcmu-runner(8)
 Main PID: 1908 (tcmu-runner)
   CGroup: /system.slice/tcmu-runner.service
           └─1908 /usr/bin/tcmu-runner

Sep 16 09:03:03 rhel2 systemd[1]: Stopping LIO Userspace-passthrough daemon...
Sep 16 09:03:03 rhel2 systemd[1]: Starting LIO Userspace-passthrough daemon...
Sep 16 09:03:03 rhel2 tcmu-runner[1908]: log file path now is '/var/log/tcmu-runner.log'
Sep 16 09:03:03 rhel2 tcmu-runner[1908]: Starting...
Sep 16 09:03:03 rhel2 systemd[1]: Started LIO Userspace-passthrough daemon.

● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/lib/systemd/system/gluster-blockd.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-09-16 09:03:03 CST; 18ms ago
 Main PID: 1952 (gluster-blockd)
   CGroup: /system.slice/gluster-blockd.service
           ├─1952 /usr/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO
           └─1955 /usr/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO

Sep 16 09:03:03 rhel2 systemd[1]: Started Gluster block storage utility.
Sep 16 09:03:03 rhel2 systemd[1]: Starting Gluster block storage utility...
# targetcli ls
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 0]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  | o- user:fbo ............................................................................................... [Storage Objects: 0]
  | o- user:glfs .............................................................................................. [Storage Objects: 1]
  | | o- block0 ........................... [repvol@localhost/block-store/5bf22bfa-965a-48c3-8a43-f0d5018e9303 (100.0MiB) activated]
  | |   o- alua ................................................................................................... [ALUA Groups: 3]
  | |     o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
  | |     o- glfs_tg_pt_gp_ano .................................................................. [ALUA state: Active/non-optimized]
  | |     o- glfs_tg_pt_gp_ao ....................................................................... [ALUA state: Active/optimized]
  | o- user:qcow .............................................................................................. [Storage Objects: 0]
  | o- user:zbc ............................................................................................... [Storage Objects: 0]
  o- iscsi ............................................................................................................ [Targets: 1]
  | o- iqn.2016-12.org.gluster-block:5bf22bfa-965a-48c3-8a43-f0d5018e9303 ................................................ [TPGs: 3]
  |   o- tpg1 ........................................................................................................... [disabled]
  |   | o- acls .......................................................................................................... [ACLs: 0]
  |   | o- luns .......................................................................................................... [LUNs: 1]
  |   | | o- lun0 ................................................................................. [user/block0 (glfs_tg_pt_gp_ao)]
  |   | o- portals .................................................................................................... [Portals: 1]
  |   |   o- 192.168.195.162:3260 ............................................................................................. [OK]
  |   o- tpg2 .................................................................................................. [gen-acls, no-auth]
  |   | o- acls .......................................................................................................... [ACLs: 0]
  |   | o- luns .......................................................................................................... [LUNs: 1]
  |   | | o- lun0 ................................................................................ [user/block0 (glfs_tg_pt_gp_ano)]
  |   | o- portals .................................................................................................... [Portals: 1]
  |   |   o- 192.168.195.163:3260 ............................................................................................. [OK]
  |   o- tpg3 ........................................................................................................... [disabled]
  |     o- acls .......................................................................................................... [ACLs: 0]
  |     o- luns .......................................................................................................... [LUNs: 1]
  |     | o- lun0 ................................................................................ [user/block0 (glfs_tg_pt_gp_ano)]
  |     o- portals .................................................................................................... [Portals: 1]
  |       o- 192.168.195.164:3260 ............................................................................................. [OK]
  o- loopback ......................................................................................................... [Targets: 0]
  o- vhost ............................................................................................................ [Targets: 0]
  o- xen-pvscsi ....................................................................................................... [Targets: 0]
# 

Did I miss something here ?

pkalever commented 4 years ago

Did I miss something here ?

I will check on this, looks like a bug in targetcli for me for now. I will fix if I get to see the same issue with my testing today. Thanks!

lxbsz commented 4 years ago

@pkalever

Test it again and found one new problem.

I have 3 nodes, named as rhel1, rhel2, rhel3.

In the rhel3 I have enabled the targetclid.service and in this node the reload will always success, no matter in which node to run the 'gluster-block reload ..' command.

That means with the targetclid service running it will work.

But if I disable targetlicd service when the node is booting, and then manually start it, mostly the reload won't work for me.

That means we must enable and start the targetclid service when booting the node.

Before I assumed that the targetclid service shouldn't affect the reload feature, is that right ?

Thanks

pkalever commented 4 years ago

@pkalever

[...]

But if I disable targetlicd service when the node is booting, and then manually start it, mostly the reload won't work for me.

That means we must enable and start the targetclid service when booting the node.

Before I assumed that the targetclid service shouldn't affect the reload feature, is that right ?

I didn't understand the problem that you are seeing completely. But it shouldn't be a problem if you enable/disable targetclid service.

The gluster-blockd.service has "Wants=targetclid.service" already with PR#203, if its started it will use daemon else not.

Can you detail the steps with which you are seeing the issue, please note I have just update PR#203, please test this PR with latest PR#203.

Thanks!

lxbsz commented 4 years ago

@pkalever

[...]

But if I disable targetlicd service when the node is booting, and then manually start it, mostly the reload won't work for me. That means we must enable and start the targetclid service when booting the node. Before I assumed that the targetclid service shouldn't affect the reload feature, is that right ?

I didn't understand the problem that you are seeing completely. But it shouldn't be a problem if you enable/disable targetclid service.

The gluster-blockd.service has "Wants=targetclid.service" already with PR#203, if its started it will use daemon else not.

Can you detail the steps with which you are seeing the issue, please note I have just update PR#203, please test this PR with latest PR#203.

Thanks!

I meant that the targetclid here is a must to run the reload. Or it will fail.

From my test: 1, systemctl enable targetclid 2, cp /etc/target/saveconfig.json /tmp/ && rm /etc/target/saveconfig.json 3, reboot the node 4, run gluster-block reload and it will work for me

But if: 1, systemctl disable targetclid 2, cp /etc/target/saveconfig.json /tmp/ && rm /etc/target/saveconfig.json 3, reboot the node 4, systemctl start targtclid 5, run gluster-block reload and mostly it does not work well

Here if without Step4, it is 100% won't work for me, with it, sometimes will work.

BTW, if user do not want the targetclid service mode, the reload won't work then.

Thanks. BRs

pkalever commented 4 years ago

@lxbsz thanks for the detailed steps. Have you test it with the latest PR#203 ?

If it still doesn't work for you, please let me know, also please let me know in which step did you start gluster-blockd in both cases.

Thanks!

lxbsz commented 4 years ago

@lxbsz thanks for the detailed steps. Have you test it with the latest PR#203 ?

If it still doesn't work for you, please let me know, also please let me know in which step did you start gluster-blockd in both cases.

Two nodes work now, and there still one not works:

# gluster-block reload repvol/block0
FAILED ON:   192.168.195.163
SUCCESSFUL ON:   192.168.195.162 192.168.195.164
RESULT: FAIL

I am not sure where there has something I missed. All the code are the same.

lxbsz commented 4 years ago

@lxbsz thanks for the detailed steps. Have you test it with the latest PR#203 ?

If it still doesn't work for you, please let me know, also please let me know in which step did you start gluster-blockd in both cases.

gluster-blockd is start by the systemd when the node is starting, not manually.

lxbsz commented 4 years ago

@lxbsz thanks for the detailed steps. Have you test it with the latest PR#203 ? If it still doesn't work for you, please let me know, also please let me know in which step did you start gluster-blockd in both cases.

Two nodes work now, and there still one not works:

# gluster-block reload repvol/block0
FAILED ON:   192.168.195.163
SUCCESSFUL ON:   192.168.195.162 192.168.195.164
RESULT: FAIL

I am not sure where there has something I missed. All the code are the same.

Sep 26 16:01:55 rhel2 gluster-blockd[1416]: restore_from_file() takes at most 4 arguments (5 given) Hit this log from the 192.168.195.163 node.

pkalever commented 4 years ago

Sep 26 16:01:55 rhel2 gluster-blockd[1416]: restore_from_file() takes at most 4 arguments (5 given)

@lxbsz This indicates your targetcli/rtslib is not updated on 192.168.195.163

lxbsz commented 4 years ago

Sep 26 16:01:55 rhel2 gluster-blockd[1416]: restore_from_file() takes at most 4 arguments (5 given)

@lxbsz This indicates your targetcli/rtslib is not updated on 192.168.195.163

rtslib-fb]# git log
commit 48d4437e42666df6348564ee21d03f29b8a8c48b
Author: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Date:   Wed Sep 25 14:56:23 2019 +0530

    restoreconfig: fix skipping of targets [re]loading

    Problem:
    'targetcli restoreconfig [savefile] [target=...]' works for first target only
    in the saveconfig.json and fails/skips for others silently.

    Solution:
    Just an indentation fix, yet very severe.

    Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>

commit 2b160b754d48d5dfdfe1d41089d4e9af24ba3b29
Author: Maurizio Lombardi <mlombard@redhat.com>
Date:   Mon Aug 26 12:06:22 2019 +0200

    version 2.1.70
rtslib-fb]# rm /usr/lib/python2.7/site-packages/rtslib_fb-2.1.70-py2.7.egg 
rm: remove regular file ‘/usr/lib/python2.7/site-packages/rtslib_fb-2.1.70-py2.7.egg’? y
rtslib-fb]# ./setup.py install
running install
running bdist_egg
[...]
targetcli-fb]# git log
commit 2a71cece17fcdd40bc022ed1fc009e6f5b9415e8
Author: Maurizio Lombardi <mlombard@redhat.com>
Date:   Mon Aug 26 12:10:40 2019 +0200

    version 2.1.50

commit 2a94314b7b131141fba885864597b3fb20af1f27
Merge: a9771b1 26b7df6
Author: Maurizio Lombardi <mlombard@redhat.com>
Date:   Mon Aug 26 09:51:26 2019 +0200

    Merge pull request #144 from pkalever/reload-single-so-tg

    [targetcli] restoreconfig: add ability to restore/reload single target or storage_object
targetcli-fb]# rm /usr/lib/python2.7/site-packages/targetcli_fb-2.1.50-py2.7.egg 
rm: remove regular file ‘/usr/lib/python2.7/site-packages/targetcli_fb-2.1.50-py2.7.egg’? y
targetcli-fb]# ./setup.py install
running install
running bdist_egg
[...]
● targetclid.service - Targetcli daemon
   Loaded: loaded (/usr/lib/systemd/system/targetclid.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2019-09-26 15:53:08 CST; 1h 3min ago
 Main PID: 1011 (targetclid)
   CGroup: /system.slice/targetclid.service
           └─1011 /usr/bin/python /usr/bin/targetclid

Sep 26 15:53:08 rhel2 systemd[1]: Started Targetcli daemon.
Sep 26 15:53:08 rhel2 systemd[1]: Starting Targetcli daemon...

All the 3 nodes are the same as above.

From the logs it seems the install is not correct, could you see the problem here ?

pkalever commented 4 years ago

@lxbsz are you expecting anything from me ? I have tested this and it works well for me. Please fix your node, why not pick a fresh node ? (check if you have any rpms installed, etc)

lxbsz commented 4 years ago

@lxbsz are you expecting anything from me ? I have tested this and it works well for me. Please fix your node, why not pick a fresh node ? (check if you have any rpms installed, etc)

Just tried a new node and it works.

Thanks.

pkalever commented 4 years ago

@lxbsz thanks for confirming. Please consider adding your test results to rtslib fix https://github.com/open-iscsi/rtslib-fb/pull/153, Maurizio will be waiting on you.

lxbsz commented 4 years ago

@lxbsz thanks for confirming. Please consider adding your test results to rtslib fix open-iscsi/rtslib-fb#153, Maurizio will be waiting on you.

Sure.

pkalever commented 4 years ago

@lxbsz updated with capabilities and version check support related patches, please check. Thanks!

lxbsz commented 4 years ago

@pkalever This looks good to me. Thanks

pkalever commented 4 years ago

@lxbsz have added man and other doc changes. Also fixed the missing force option with reload.

Requesting a detailed review. Thanks!

pkalever commented 4 years ago

@lxbsz please take a look.

pkalever commented 4 years ago

@lxbsz updated the tags, merged now. Thanks!