gluster / gluster-block

A framework for gluster block storage
GNU General Public License v2.0
74 stars 32 forks source link

Gluster-Block and iSCSI issue with LUN #286

Closed jonahbohlmann closed 3 years ago

jonahbohlmann commented 3 years ago

Hello,

I am new to gluster and also gluster-block, and also I have no idea if this is a bug or just an issue on my side. I try to explain with all needed details and maybe someone can tell me if this is the right place, or I need to move it to some other place.

Content of post:

Introduction I have two gluster clusters. One for production and one as a test environment. The production instance is working fine since months. The test environment has an issue after 3 months. I just wanted to add as a note that the production environment has currently not the same problem. Below I am just talking about the test environment which is equal to the production.

I use iSCSI to mount the block volume into my software (which needs iSCSI as storage backend).

Informations

The gluster cluster has three nodes.

Versions:

CentOS Linux release 7.8.2003

glusterfs 8.1
gluster-block (0.3)

Gluster Peer Status (from first server):

[STAGING] [17:50:48 root@fra1-glusterfs-m01]{~}>gluster peer status
Number of Peers: 2
Hostname: fra1-glusterfs-m02.staging.domain.network
Uuid: 4d6efc81-0e0b-4606-89f2-88ca9276a72c
State: Peer in Cluster (Connected)

Hostname: fra1-glusterfs-m03.staging.domain.network
Uuid: a5ae2904-e1b6-4e67-8159-bacd1069095b
State: Peer in Cluster (Connected)

Gluster volume info:

[STAGING] [17:52:24 root@fra1-glusterfs-m01]{~}>gluster volume info rdxarchive_2020
Volume Name: rdxarchive_2020
Type: Replicate
Volume ID: 773a09e1-e4cc-4cc6-89c9-dcf3ce2d805e
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: fra1-glusterfs-m01.staging.domain.network:/data/disk-data-1/rdxarchive_2020
Brick2: fra1-glusterfs-m02.staging.domain.network:/data/disk-data-1/rdxarchive_2020
Brick3: fra1-glusterfs-m03.staging.domain.network:/data/disk-data-1/rdxarchive_2020
Options Reconfigured:
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
performance.open-behind: off
performance.readdir-ahead: off
performance.strict-o-direct: on
performance.client-io-threads: on
performance.io-thread-count: 32
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 4
client.event-threads: 8
server.event-threads: 8
network.remote-dio: disable
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
features.shard-block-size: 64MB
user.cifs: off
server.allow-insecure: on
cluster.choose-local: off
[STAGING] [17:53:57 root@fra1-glusterfs-m01]{~}>gluster-block info rdxarchive_2020/block-volume
NAME: block-volume
VOLUME: rdxarchive_2020
GBID: e2522647-d101-4d74-8978-ed24fa5813c6
SIZE: 10.0 GiB
HA: 3
PASSWORD:
EXPORTED NODE(S): 10.0.33.154 10.0.33.155 10.0.33.156

Until this point I think everything is fine.

The issue

After some time of usage, I got an alert that the storage is no longer available. The gluster cluster itself is healthy, and I had not network issues or something similar tracked by my monitoring. I tried to reboot the nodes, services and checked logs. No information found about a real problem.

The software creator, where I use the iSCSI, told me that there is no LUN from the interface. So I checked with targetcli ls

[STAGING] [18:03:38 root@fra1-glusterfs-m01]{~}>targetcli ls
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 0]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  | o- user:glfs .............................................................................................. [Storage Objects: 0]
  | o- user:qcow .............................................................................................. [Storage Objects: 0]
  o- iscsi ............................................................................................................ [Targets: 1]
  | o- iqn.2016-12.org.gluster-block:e2522647-d101-4d74-8978-ed24fa5813c6 ................................................ [TPGs: 3]
  |   o- tpg1 .................................................................................................. [gen-acls, no-auth]
  |   | o- acls .......................................................................................................... [ACLs: 0]
  |   | o- luns .......................................................................................................... [LUNs: 0]
  |   | o- portals .................................................................................................... [Portals: 1]
  |   |   o- 10.0.33.154:3260 ................................................................................................. [OK]
  |   o- tpg2 ........................................................................................................... [disabled]
  |   | o- acls .......................................................................................................... [ACLs: 0]
  |   | o- luns .......................................................................................................... [LUNs: 0]
  |   | o- portals .................................................................................................... [Portals: 1]
  |   |   o- 10.0.33.155:3260 ................................................................................................. [OK]
  |   o- tpg3 ........................................................................................................... [disabled]
  |     o- acls .......................................................................................................... [ACLs: 0]
  |     o- luns .......................................................................................................... [LUNs: 0]
  |     o- portals .................................................................................................... [Portals: 1]
  |       o- 10.0.33.156:3260 ................................................................................................. [OK]
  o- loopback ......................................................................................................... [Targets: 0]

On production, I have this view:

o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 0]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  | o- user:glfs .............................................................................................. [Storage Objects: 1]
  | | o- block-volume ................. [rdxarchive@10.0.48.98/block-store/fafeff57-56fd-455c-bfe9-9868109fa8a5 (50.0GiB) activated]
  | |   o- alua ................................................................................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
  | o- user:qcow .............................................................................................. [Storage Objects: 0]
  o- iscsi ............................................................................................................ [Targets: 1]
  | o- iqn.2016-12.org.gluster-block:fafeff57-56fd-455c-bfe9-9868109fa8a5 ................................................ [TPGs: 3]
  |   o- tpg1 .................................................................................................. [gen-acls, no-auth]
  |   | o- acls .......................................................................................................... [ACLs: 0]
  |   | o- luns .......................................................................................................... [LUNs: 1]
  |   | | o- lun0 ........................................................................... [user/block-volume (default_tg_pt_gp)]
  |   | o- portals .................................................................................................... [Portals: 1]
  |   |   o- 10.0.48.98:3260 .................................................................................................. [OK]
  |   o- tpg2 ........................................................................................................... [disabled]
  |   | o- acls .......................................................................................................... [ACLs: 0]
  |   | o- luns .......................................................................................................... [LUNs: 1]
  |   | | o- lun0 ........................................................................... [user/block-volume (default_tg_pt_gp)]
  |   | o- portals .................................................................................................... [Portals: 1]
  |   |   o- 10.0.48.99:3260 .................................................................................................. [OK]
  |   o- tpg3 ........................................................................................................... [disabled]
  |     o- acls .......................................................................................................... [ACLs: 0]
  |     o- luns .......................................................................................................... [LUNs: 1]
  |     | o- lun0 ........................................................................... [user/block-volume (default_tg_pt_gp)]
  |     o- portals .................................................................................................... [Portals: 1]
  |       o- 10.0.48.100:3260 ................................................................................................. [OK]
  o- loopback ......................................................................................................... [Targets: 0]

So for me, I miss the "lun0" section in the test environment.

Any idea what can be the problem and how I can solve it? What else can I check for the problem? Any information missing in my post?

I hope someone has an idea.

Thank you!

Edit: It is also prossible to apply to my upwork job, I can pay you for support: https://www.upwork.com/jobs/~01867a80c9701e4070

jonahbohlmann commented 3 years ago

The issue could be solved.

I moved from /etc/target/backup/* a file to /etc/target/saveconfig.json.

In the current "saveconfig.json" was no "lun" entry. I was able to recover the backup config file with the lun entry:

"luns": [
            {
              "alias": "288488321c",
              "alua_tg_pt_gp_name": "default_tg_pt_gp",
              "index": 0,
              "storage_object": "/backstores/user/block-volume"
            }
          ],

Main question is the same: why are the luns removed from the config? Can this happen again? In logs, I saw that saveconfig is something generated. From where do the service get the information about the luns?

jonahbohlmann commented 3 years ago

I am not a C developer, so it is hard for me to understand the code. But I think the place where the config is generated is this part: https://github.com/gluster/gluster-block/blob/4f994e3cfa440d9dae1317a1d2b03ed994a03ef4/rpc/block_genconfig.c#L65

But I don't understand where the data is coming from and why it is missing for the environment.

Also, I saw that sometimes the config will be regenerated. So restore a working version is just a "for the moment" solution but may not a general solution for my issue.

jonahbohlmann commented 3 years ago

No idea? I have quite often the issue, that now the "luns" part are removed from saveconfig.json.

I now have to restore the saveconfig.json after everytime I restart "gluster-blockd" service. Why? What happend there?

Please support!

Thanks.

pkalever commented 3 years ago

@jonahbohlmann the project is considered to be in maintenance-only status. I will consider adding this in the ReadMe doc soon. Please also expect slow replies to issues.

[STAGING] [17:52:24 root@fra1-glusterfs-m01]{~}>gluster volume info rdxarchive_2020 We recommend replica 3 volume with group profile applied on it. Helpful command: # gluster vol set group gluster-block

Main question is the same: why are the luns removed from the config? Can this happen again? In logs, I saw that saveconfig is something generated. From where do the service get the information about the luns?

You can refer to this as storage objects missing, these are backstores under user:glfs in targetcli list output. You will notice missing storage objects when you do reboots of the nodes or restart of the gluster-blockd service. In case if there are any issues from the backend block hosting glusterfs volumes the tcmu-runner fails to load them.

Please check your tcmu-runner.log gluster-blockd.log and other logs in the gulster-block directory.

The missing configurations in the /etc/target/saveconfig.json are because of a previous bug, if there were new create/delete requests when there few unloaded block volumes then you might hit this bug. We highly recommend you upgrade to 0.5 or 0.5.1, which has fixed this issue.

I am not a C developer, so it is hard for me to understand the code. But I think the place where the config is generated is this part:

Yes, there is a way to generate the missing config per node:

$ systemctl stop tcmu-runner gluster-blockd gluster-block-target
$ mv /etc/target/saveconfig.json /home/<backup>
$ gluster-block genconfig <BHV-comma-seperated-list> enable-tpg <local-ip> | tee /etc/target/saveconfig.json
$ systemctl start gluster-blockd

But I don't understand where the data is coming from and why it is missing for the environment.

The data comes from BHV's, there is a /block-meta directory in every BHV where journal the per block volume metadata

Also, I saw that sometimes the config will be regenerated. So restore a working version is just a "for the moment" solution but may not a general solution for my issue.

Can you paste the logs?

No idea? I have quite often the issue, that now the "luns" part are removed from saveconfig.json.

I now have to restore the saveconfig.json after everytime I restart "gluster-blockd" service. Why? What happend there?

Suggest to check the logs from /var/log/.../gluster-block/

Before anything else:

  1. stop and apply the gluster-block profile on your BHV
    On all the server nodes:
    $ systemctl stop tcmu-runner gluster-blockd gluster-block-target
    On one server node:
    $ gluster vol stop  rdxarchive_2020
    $  gluster vol set rdxarchive_2020 group gluster-block
    $ gluster vol start rdxarchive_2020
  2. Upgrade your gluster-block to 0.5.x

Good Luck!

jonahbohlmann commented 3 years ago

Hello @pkalever,

thank you so much for you support and the detailed explanation. I got a better overview in how the stuff works.

I was able to create the RPM Package with version 0.5.1 and successfully installed the version on our test environment and production. At this moment, I see no more issues regarding the missing LUNs. So yes, the version upgrade and/or applying the profile fixed the issue.

Maybe you can attach the RPM files to the latest release because I think it would be very helpful. I was not able to find any repository with the latest version for CentOS 7.

Again, thank you.

Best Regards!

pkalever commented 3 years ago

Assuming the issues are fixed, Closing this now, please feel free to open a new issue as needed.

Thanks!