gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.63k stars 1.08k forks source link

Gluster 11.0 2-way replication all bricks down + Wrong Replication Type (Bug) #4114

Open madmax01 opened 1 year ago

madmax01 commented 1 year ago

Description of problem: i got on 13.04 "all bricks down" issue. Glusterd Services are up (and green - no red alert). Network is up and responsive. Firewalld and selinux are not activ

The exact command to reproduce the issue: gluster volume status (to check brick status) and i see for both "N"

The full output of the command that failed:

Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sds01:/vdo/vdo0/gv0/bricks/brick1     N/A       N/A        N       N/A
Brick sds02:/vdo/vdo0/gv0/bricks/brick2     N/A       N/A        N       N/A
Self-heal Daemon on localhost               N/A       N/A        Y       3373575
NFS Server on localhost                     2049      0          Y       3373568
Self-heal Daemon on sds02                   N/A       N/A        Y       3374078
NFS Server on sds02                         2049      0          Y       3374063

Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

Expected results: Y

Mandatory info: - The output of the gluster volume info command:

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: ec8ea48e-b120-4309-adee-72aebcc89071
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: sds01:/vdo/vdo0/gv0/bricks/brick1
Brick2: sds02:/vdo/vdo0/gv0/bricks/brick2
Options Reconfigured:
cluster.lookup-optimize: off
cluster.server-quorum-type: server
cluster.quorum-count: 1
cluster.quorum-type: fixed
server.keepalive-count: 5
server.keepalive-interval: 2
server.keepalive-time: 10
server.tcp-user-timeout: 10
network.ping-timeout: 10
cluster.shd-wait-qlength: 10000
cluster.locking-scheme: granular
performance.enable-least-priority: no
performance.low-prio-threads: 32
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.quick-read: off
performance.read-ahead: off
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
network.compression.compression-level: 0
cluster.shd-max-threads: 8
nfs.event-threads: 4
server.event-threads: 4
client.event-threads: 4
diagnostics.client-log-level: WARNING
diagnostics.brick-log-level: WARNING
cluster.favorite-child-policy: mtime
nfs.rpc-auth-allow: 10.2.1.0/24
cluster.granular-entry-heal: enable
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: off
performance.client-io-threads: on
cluster.server-quorum-ratio: 1%

- The output of the gluster volume status command:

Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sds01:/vdo/vdo0/gv0/bricks/brick1     N/A       N/A        N       N/A
Brick sds02:/vdo/vdo0/gv0/bricks/brick2     N/A       N/A        N       N/A
Self-heal Daemon on localhost               N/A       N/A        Y       3373575
NFS Server on localhost                     2049      0          Y       3373568
Self-heal Daemon on sds02                   N/A       N/A        Y       3374078
NFS Server on sds02                         2049      0          Y       3374063

Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command: both bricks down (not able to fetch volfile from glusterd

**- Provide logs present on following locations of client and server nodes - https://we.tl/t-vhRe9vteni

**- Is there any crash ? Provide the backtrace and coredump here is the coredump (don't know about backtrace) its alma 8.7 simple downstream of RH https://we.tl/t-ouj08o13Xm

Additional info:

- The operating system / glusterfs version: AlmaLinux release 8.7 (Stone Smilodon) / 11.0

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

madmax01 commented 1 year ago

just as info.. this crash happens started with 10.2 (10.1 was fine and never crashed) but each version after now includes 11 also crashes... specific change made after 10.1 makes bricks/glusterd unhappy after a while (10.2,10.3 + 11).. 10.4 not tested

madmax01 commented 1 year ago

Also strange is > it was configured as 2-way replication.. but command shows Distributed-Replicate which is another fault

madmax01 commented 1 year ago

Any News on the crash and on the Distributed-replicate where it should be replicated. Other user already dedicated opend a thread for 2nd issue and no answer as well on that one.

pitastrudl commented 1 year ago

Same issue for me. Downgraded to 10.4 gluster on ubuntu 20.04 for now.