gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.68k stars 1.08k forks source link

Glusterfs v10.4 'No space left on device' yet we have plenty of space all nodes #4135

Open brandonshoemakerTH opened 1 year ago

brandonshoemakerTH commented 1 year ago

Description of problem: We are seeing 'error=No space left on device' issue on Glusterfs 10.4 on AlmaLinux 8 (4.18.0-425.19.2.el8_7.x86_64) even though we have currently 61 TB available on the volume and each of the 12 nodes have 2-8 TB free so we are nowhere near out of space on any node.

example log msg from /var/log/glusterfs/home-volbackups.log

[2023-05-06 23:47:38.645324 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:670:client4_0_writev_cbk] 0-volbackups-client-23: remote operation failed. [{errno=28}, {error=No space left on device}] [2023-05-06 23:47:38.645376 +0000] W [fuse-bridge.c:1970:fuse_err_cbk] 0-glusterfs-fuse: 980901423: FLUSH() ERR => -1 (No space left on device)

The exact command to reproduce the issue: We use vsftpd and glusterfs for around 8 years for ftp uploads of backup files and around 3 years for nfs uploads of backup files. Each glusterfs node has a single brick and mounts locally a single distributed volume as a glusterfs client locally and receives ftp>vsftpd>glusterfs backup files to the volume each weekend. After about 24 hours of ftp uploads the no space error starts in the logs and then writes start failing. However, we have plenty of space all nodes and we are using 'cluster.min-free-disk: 1GB' volume setting. If we reboot all the glusterfs nodes the problem goes away for a while but, then returns again after ~12-24 hours.

The full output of the command that failed: Here is an example ftp backup file upload that fails this weekend: put: 125ac755-05b1-4d48-9a7d-96e7cd423700-vda.bak: Access failed: 553 Could not create file. (125ac755-05b1-4d48-9a7d-96e7cd423700-vda.qcow2)

Here are some example nfs backup file writes that fail from last weekend: /bin/cp: failed to close '/backups/instance-00016239.xml': No space left on device /bin/cp: failed to close '/backups/instance-00016221.xml': No space left on device /bin/cp: failed to close '/backups/instance-00016248.xml': No space left on device /bin/cp: failed to close '/backups/instance-0001625a.xml': No space left on device qemu-img: error while writing sector 19931136: No space left on device qemu-img: Failed to flush the L2 table cache: No space left on device qemu-img: Failed to flush the refcount block cache: No space left on device qemu-img: /backups/2699ee2f-92b8-4804-a7c7-1dc4e2abed29-vda.qcow2: error while converting qcow2: Could not close the new file: No space left on device /bin/cp: failed to close '/backups/73fa3986-f450-4b36-b7d4-dcbdcd494562-instance-0001609e-disk.config': No space left on device /bin/cp: failed to close '/backups/instance-00016104.xml': No space left on device /bin/cp: failed to close '/backups/5c82fbdb-2be7-45fe-871d-604453868edc-instance-000160f2-disk.config': No space left on device /bin/cp: failed to close '/backups/24acc824-94d5-4026-9abe-072a1b257cc0-instance-00016119-disk.info': No space left on device /bin/cp: failed to close '/backups/instance-0001611f.xml': No space left on device /bin/cp: failed to close '/backups/instance-0001613d.xml': No space left on device

Expected results: It is expected for ftp and nfs upload writes to succeed as they have in the past.

Mandatory info: - The output of the gluster volume info command:

[root@nybaknode1 ~]# gluster volume info volbackups

Volume Name: volbackups Type: Distribute Volume ID: cd40794d-ab74-4706-a0bc-3e95bb8c63a2 Status: Started Snapshot Count: 0 Number of Bricks: 12 Transport-type: tcp Bricks: Brick1: nybaknode9.domain.net:/lvbackups/brick Brick2: nybaknode11.domain.net:/lvbackups/brick Brick3: nybaknode2.domain.net:/lvbackups/brick Brick4: nybaknode3.domain.net:/lvbackups/brick Brick5: nybaknode4.domain.net:/lvbackups/brick Brick6: nybaknode12.domain.net:/lvbackups/brick Brick7: nybaknode5.domain.net:/lvbackups/brick Brick8: nybaknode6.domain.net:/lvbackups/brick Brick9: nybaknode7.domain.net:/lvbackups/brick Brick10: nybaknode8.domain.net:/lvbackups/brick Brick11: nybaknode10.domain.net:/lvbackups/brick Brick12: nybaknode1.domain.net:/lvbackups/brick Options Reconfigured: performance.cache-size: 256MB server.event-threads: 16 performance.io-thread-count: 32 performance.client-io-threads: on client.event-threads: 16 diagnostics.brick-sys-log-level: WARNING diagnostics.brick-log-level: WARNING performance.cache-max-file-size: 2MB transport.address-family: inet nfs.disable: on cluster.min-free-disk: 1GB [root@nybaknode1 ~]#

- The output of the gluster volume status command:

[root@nybaknode1 ~]# gluster volume status volbackups Status of volume: volbackups Gluster process TCP Port RDMA Port Online Pid

Brick nybaknode9.domain.net:/lvbackups/b rick 59026 0 Y 1986 Brick nybaknode11.domain.net:/lvbackups/ brick 60172 0 Y 2033 Brick nybaknode2.domain.net:/lvbackups/b rick 58067 0 Y 1579 Brick nybaknode3.domain.net:/lvbackups/b rick 58210 0 Y 1603 Brick nybaknode4.domain.net:/lvbackups/b rick 52719 0 Y 1681 Brick nybaknode12.domain.net:/lvbackups/ brick 52193 0 Y 1895 Brick nybaknode5.domain.net:/lvbackups/b rick 53655 0 Y 1667 Brick nybaknode6.domain.net:/lvbackups/b rick 56614 0 Y 1591 Brick nybaknode7.domain.net:/lvbackups/b rick 49492 0 Y 1719 Brick nybaknode8.domain.net:/lvbackups/b rick 51497 0 Y 1701 Brick nybaknode10.domain.net:/lvbackups/ brick 49787 0 Y 1878 Brick nybaknode1.domain.net:/lvbackups/b rick 52392 0 Y 1781

Task Status of Volume volbackups

Task : Rebalance ID : 1ea52278-ea1b-4d7e-857a-fe2ee1dc5420 Status : completed

[root@nybaknode1 ~]#

- The output of the gluster volume heal command:

Not relevant. We are using a plain distributed no replica

- The output of the gluster volume status detail command:

[root@nybaknode1 ~]# gluster volume status volbackups detail Status of volume: volbackups

Brick : Brick nybaknode9.domain.net:/lvbackups/brick TCP Port : 59026 RDMA Port : 0 Online : Y Pid : 1986 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota Inode Size : 512 Disk Space Free : 4.6TB Total Disk Space : 29.0TB Inode Count : 3108974976 Free Inodes : 3108903409

Brick : Brick nybaknode11.domain.net:/lvbackups/brick TCP Port : 60172 RDMA Port : 0 Online : Y Pid : 2033 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota Inode Size : 512 Disk Space Free : 8.2TB Total Disk Space : 43.5TB Inode Count : 4672138432 Free Inodes : 4672063970

Brick : Brick nybaknode2.domain.net:/lvbackups/brick TCP Port : 58067 RDMA Port : 0 Online : Y Pid : 1579 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota Inode Size : 512 Disk Space Free : 5.4TB Total Disk Space : 29.0TB Inode Count : 3108921344 Free Inodes : 3108849261

Brick : Brick nybaknode3.domain.net:/lvbackups/brick TCP Port : 58210 RDMA Port : 0 Online : Y Pid : 1603 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=512,noquota Inode Size : 512 Disk Space Free : 4.6TB Total Disk Space : 29.0TB Inode Count : 3108921344 Free Inodes : 3108849248

Brick : Brick nybaknode4.domain.net:/lvbackups/brick TCP Port : 52719 RDMA Port : 0 Online : Y Pid : 1681 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=512,noquota Inode Size : 512 Disk Space Free : 5.0TB Total Disk Space : 29.0TB Inode Count : 3108921344 Free Inodes : 3108848785

Brick : Brick nybaknode12.domain.net:/lvbackups/brick TCP Port : 52193 RDMA Port : 0 Online : Y Pid : 1895 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota Inode Size : 512 Disk Space Free : 7.5TB Total Disk Space : 43.5TB Inode Count : 4671718976 Free Inodes : 4671644748

Brick : Brick nybaknode5.domain.net:/lvbackups/brick TCP Port : 53655 RDMA Port : 0 Online : Y Pid : 1667 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=512,noquota Inode Size : 512 Disk Space Free : 3.3TB Total Disk Space : 29.0TB Inode Count : 3108921344 Free Inodes : 3108849458

Brick : Brick nybaknode6.domain.net:/lvbackups/brick TCP Port : 56614 RDMA Port : 0 Online : Y Pid : 1591 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota Inode Size : 512 Disk Space Free : 5.4TB Total Disk Space : 29.0TB Inode Count : 3108921344 Free Inodes : 3108849533

Brick : Brick nybaknode7.domain.net:/lvbackups/brick TCP Port : 49492 RDMA Port : 0 Online : Y Pid : 1719 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=512,noquota Inode Size : 512 Disk Space Free : 2.4TB Total Disk Space : 14.4TB Inode Count : 1546333376 Free Inodes : 1546264508

Brick : Brick nybaknode8.domain.net:/lvbackups/brick TCP Port : 51497 RDMA Port : 0 Online : Y Pid : 1701 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,sunit=128,swidth=128,noquota Inode Size : 512 Disk Space Free : 4.4TB Total Disk Space : 29.0TB Inode Count : 3108921344 Free Inodes : 3108849200

Brick : Brick nybaknode10.domain.net:/lvbackups/brick TCP Port : 49787 RDMA Port : 0 Online : Y Pid : 1878 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota Inode Size : 512 Disk Space Free : 6.7TB Total Disk Space : 29.0TB Inode Count : 3108921344 Free Inodes : 3108850142

Brick : Brick nybaknode1.domain.net:/lvbackups/brick TCP Port : 52392 RDMA Port : 0 Online : Y Pid : 1781 File System : xfs Device : /dev/mapper/vgbackups-lvbackups Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=128,swidth=128,noquota Inode Size : 512 Disk Space Free : 6.6TB Total Disk Space : 29.0TB Inode Count : 3108921344 Free Inodes : 3108850426

[root@nybaknode1 ~]#

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

Here are sanitized logs attached from one of the affected gluster nodes that experienced the issue today and last week. If you need more logs please notify back and we are willing to share more logs directly with someone. We have 12 glusterfs nodes in this location for our backups.

**- Is there any crash ? Provide the backtrace and coredump

No crash is involved as far as i know

Additional info:

We are seeing 'error=No space left on device' issue on Glusterfs 10.4 on AlmaLinux 8 (4.18.0-425.19.2.el8_7.x86_64) and hoping someone might could help advise as its become critical since we use glusterfs for backups of entire infrastucture for this affected location (NYC). We have another different location similarly configured on 10.3 not yet experiencing this issue but, its about 60% smaller size by number of nodes.

We are using a 12 node glusterfs v10.4 (plain) distributed vsftpd backup cluster for years (not new) and recently 3-4 weeks ago upgraded to v9 > v10.4. I do not know if the upgrade is related to this new issue.

We are seeing a new issue 'error=No space left on device' error below on multiple gluster v10.4 nodes in the logs. At the moment seeing it in the logs for about half (5 out of 12) of the nodes last week and 2 more today before i rebooted. The issue will go away if we reboot all the glusterfs nodes but, backups take a little over 2 days to complete each weekend and the issue returns after about 1 day of backups running and before the backup cycle is complete. It has been happening the last 3 weekends we have run backups to these nodes.

example log msg from /var/log/glusterfs/home-volbackups.log

[2023-05-06 23:47:38.645324 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:670:client4_0_writev_cbk] 0-volbackups-client-23: remote operation failed. [{errno=28}, {error=No space left on device}] [2023-05-06 23:47:38.645376 +0000] W [fuse-bridge.c:1970:fuse_err_cbk] 0-glusterfs-fuse: 980901423: FLUSH() ERR => -1 (No space left on device)

Each glusterfs node has a single brick and mounts locally a single distributed volume as a glusterfs client locally and receives over ftp and over nfs-ganesha our backup files to the volume each weekend. This weekend we tested only ftp uploads and the problem happened the same with or without nfs-ganesha backup file uploads.

We distribute the ftp upload load between the servers through a combination of /etc/hosts entries and AWS weighted dns. We also use nfs-ganesha but, this weekend we ran only FTP backup uploads as a test to rule out nfs-ganesha and just experienced the same issue with ftp uploads only.

We have currently 61 TB available on the volume though and each of the 12 nodes have 2-8 TB free so we are nowhere near out of space on any node?

We have already tried the setting change from 'cluster.min-free-disk: 1%' to 'cluster.min-free-disk: 1GB' and rebooted all the gluster nodes to refresh them and it happened again. That was mentioned in this doc https://access.redhat.com/solutions/276483 as an idea.

Does anyone know what we might check next?

Crossposted to https://lists.gluster.org/pipermail/gluster-users/2023-May/040289.html

- The operating system / glusterfs version:

Almalinux 8 4.18.0-425.19.2.el8_7.x86_64 [root@nybaknode1 ~]# rpm -qa | grep 'gluster|nfs' nfs-ganesha-selinux-3.5-3.el8.noarch glusterfs-client-xlators-10.4-1.el8s.x86_64 nfs-ganesha-utils-3.5-3.el8.x86_64 glusterfs-selinux-2.0.1-1.el8s.noarch libglusterd0-10.4-1.el8s.x86_64 nfs-ganesha-gluster-3.5-3.el8.x86_64 libnfsidmap-2.3.3-57.el8_7.1.x86_64 libglusterfs0-10.4-1.el8s.x86_64 glusterfs-cli-10.4-1.el8s.x86_64 glusterfs-server-10.4-1.el8s.x86_64 nfs-ganesha-3.5-3.el8.x86_64 centos-release-nfs-ganesha30-1.0-2.el8.noarch glusterfs-fuse-10.4-1.el8s.x86_64 sssd-nfs-idmap-2.7.3-4.el8_7.3.x86_64 centos-release-gluster10-1.0-1.el8.noarch glusterfs-10.4-1.el8s.x86_64 nfs-utils-2.3.3-57.el8_7.1.x86_64 [root@nybaknode1 ~]#

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

logs-screenshots-sanitized.zip

mohit84 commented 1 year ago

In release(10.4) recently we did change the code-path to respect the storage.reserve value by the patch(https://github.com/gluster/glusterfs/issues/3636), due to that you are facing an issue. For the time being, i would suggest downgrading the glusterfs on release-10.3 to avoid this issue. I will try to fix the same.

mohit84 commented 1 year ago

Can you please take statedump of any one brick process that is throwing "No space left on device" error currently? To take a statedump you have to send a SIGUSR1 signal to the brick process "kill -SIGUSR1 ", the command will generate a statedump in /var/run/gluster directory.

brandonshoemakerTH commented 1 year ago

Hi @mohit84, Thanks so much for the prompt reply and advise. I had to reboot all the nodes just before i posted this issue here to clear the issue so it will take another 12-24 hours before we see the issue reoccur but, it will so will come back with the requested statedump.

Can you point me to any docs or advise the basic approach to follow for a downgrade to 10.3 on RHEL8/AlmaLinux8? Is it a reliable procedure? Unfortunately, i'm not familiar with what all this would entail and this is a 12 node 362 TB backup volume. 'yum downgrade [glusterfs-server-pkg]' does not offer anything so seems would be something more manual for a process.

brandonshoemakerTH commented 1 year ago

Re-opening. Sorry it seemed it closed on last reply

mohit84 commented 1 year ago

Hi @mohit84, Thanks so much for the prompt reply and advise. I had to reboot all the nodes just before i posted this issue here to clear the issue so it will take another 12-24 hours before we see the issue reoccur but, it will so will come back with the requested statedump.

Can you point me to any docs or advise the basic approach to follow for a downgrade to 10.3 on RHEL8/AlmaLinux8? Is it a reliable procedure? Unfortunately, i'm not familiar with what all this would entail and this is a 12 node 362 TB backup volume. 'yum downgrade [glusterfs-server-pkg]' does not offer anything so seems would be something more manual for a process.

The downgrade procedure is similar to the upgrade, you need to follow the same process. Yes, it is completely safe.

mohit84 commented 1 year ago

Hi @mohit84, Thanks so much for the prompt reply and advise. I had to reboot all the nodes just before i posted this issue here to clear the issue so it will take another 12-24 hours before we see the issue reoccur but, it will so will come back with the requested statedump. Can you point me to any docs or advise the basic approach to follow for a downgrade to 10.3 on RHEL8/AlmaLinux8? Is it a reliable procedure? Unfortunately, i'm not familiar with what all this would entail and this is a 12 node 362 TB backup volume. 'yum downgrade [glusterfs-server-pkg]' does not offer anything so seems would be something more manual for a process.

The downgrade procedure is similar to the upgrade, you need to follow the same process. Yes, it is completely safe.

You can try once in test environment if you are hesitant to try in the production environment.

brandonshoemakerTH commented 1 year ago

Ok yea i will setup test server to test it. I will look for the 10.3 packages tomorrow as it is midnight here. Thanks for advice.

brandonshoemakerTH commented 1 year ago

Hi @mohit84

I have the statedump file now as we had the issue happen again in the last hour. I've sanitized the file i think and removed our domain references. Is there anything else in this file that might be sensitive besides the domain/hostname reference? Its 230215 lines so not able to check it and be sure.

Is it possible i can send this to you privately somehow or only through a public reply post here or is the only thing sensitive in the file the hostname and directory path references?

mohit84 commented 1 year ago

Yes you can share it on my mail id moagrawa@redhat.com

brandonshoemakerTH commented 1 year ago

Thanks @mohit84 we sent the statedump and other log files. We have downgraded to 10.3 this morning and will re-run our backups to these glusterfs 10.3 servers. Do let us know if we can assist your team with anything else regarding this issue. We will report back in a few days after backups hopefully complete without re-encountering the issue.

eg-ops commented 1 year ago

Since updating to version 10.4, we have been facing the same issue. After a couple of hours, we receive the error message 'No space left on device', and we have to restart all three GlusterFS nodes. After that, it works for the next couple of hours until we encounter the same issue again.

brandonshoemakerTH commented 1 year ago

@mohit84 Our backups sent to these glusterfs nodes did complete after 2 days running without encountering the issue again after downgrading to 10.3. We appreciate your help on this issue.

@eg-ops you should consider the same 10.3 downgrade. It does seem to be an issue in 10.4 and not affecting 10.3 from the testing we just did.

FleloShe commented 1 year ago

Hi there, we are currently experiencing the same issue with 10.4. Unfortunately we can't find the 10.3 package for ubuntu (specifically ubuntu 18 bionic). It would be awesome to get some hints where to get the packages!

xhernandez commented 1 year ago

@brandonshoemakerTH @eg-ops @FleloShe do you create hard linked files in the Gluster volume that get deleted (at least one of the hardlinks) regularly ?

brandonshoemakerTH commented 1 year ago

@xhernandez no hardlinks used by us @FleloShe sorry i'm not so familiar with gluster pkgs on ubuntu

last 2 weeks we have no seen the issue re-occur on 10.3

FleloShe commented 1 year ago

@xhernandez in our case only one brick appears to be affected cause only 1 gluster node out of 4 is updated from 10.2 to 10.4. The related volume is only used for persisting data for a dockerized redis-instance. I can't really tell what redis does there, but it appears it creates a dump file every X minutes which should be absolutely doable for gluster.

Edit: Log from /var/log/glusterfs/bricks/glusterfs-myvolumename-vol.log [2023-05-26 08:47:10.244980 +0000] E [MSGID: 115067] [server-rpc-fops_v2.c:1324:server4_writev_cbk] 0-myvolumename-vol-server: WRITE info [{frame=168085833}, {WRITEV_fd_no=0}, {uuid_utoa=00afcfe7-5701-418e-b8f8-ff1984032a68}, {client=CTX_ID:c70f43ca-2c20-41fa-b7e2-9786339b84fa-GRAPH_ID:0-PID:3542-HOST:myhostname-PC_NAME:myvolumename-vol-client-0-RECON_NO:-6}, {error-xlator=myvolumename-vol-posix}, {errno=28}, {error=No space left on device}]

nikow commented 1 year ago

Can i safely downgrade from 11.0 to 10.03, too?

I spotted that i stop the volume and start it back, it start working again. Another thing is that i can increase 'time before locking again' by increasing amount of file descriptors.

mohit84 commented 1 year ago

Can i safely downgrade from 11.0 to 10.03, too?

I spotted that i stop the volume and start it back, it start working again. Another thing is that i can increase 'time before locking again' by increasing amount of file descriptors.

Yes you can downgrade safely. Would it be possible for you to share the reproducer steps we are not facing any issue in our daily regression test build server?

ben-xo commented 1 year ago

Following this issue as we have started to encounter it on 10.4 as well

ufou commented 1 year ago

Are there any sysctl or glusterfs values which could be tuned to help delay this error until a permanent fix is created?

ufou commented 1 year ago

I tried downgrading a single node (Ubuntu Jammy running 10.4 package from http://ppa.launchpad.net/gluster/glusterfs-10/ubuntu) after creating some 10.3 Ubuntu jammy packages - unfortunately after installing the 10.3 packages, although the glusterd process starts normally, the gluster brick processes fail to start:

[2023-07-17 11:58:48.847641 +0000] E [MSGID: 106005] [glusterd-utils.c:6917:glusterd_brick_start] 0-management: Unable to start brick server1:/media/storage

and brick logs:

[2023-07-17 11:58:48.773195 +0000] W [MSGID: 101095] [xlator.c:392:xlator_dynload] 0-xlator: DL open failed [{error=/usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/protocol/server.so: undefined symbol:
xdr_gfx_readdir_rsp}]
[2023-07-17 11:58:48.773216 +0000] E [MSGID: 101002] [graph.y:211:volume_type] 0-parser: Volume 'storage-server', line 133: type 'protocol/server' is not valid or not found on this machine
[2023-07-17 11:58:48.773242 +0000] E [MSGID: 101019] [graph.y:321:volume_end] 0-parser: "type" not specified for volume storage-server
[2023-07-17 11:58:48.773539 +0000] E [MSGID: 100026] [glusterfsd.c:2509:glusterfs_process_volfp] 0-: failed to construct the graph []

Should I try 10.2?

ufou commented 1 year ago

OK, ignore the last comment, I neglected to install all the supporting libs created by the build.sh script, so this now works to downgrade to 10.3:

dpkg -i libgfrpc0_10.3-ubuntu1~jammy1_amd64.deb libgfapi0_10.3-ubuntu1~jammy1_amd64.deb libgfchangelog0_10.3-ubuntu1~jammy1_amd64.deb glusterfs-client_10.3-ubuntu1~jammy1_amd64.deb glusterfs-common_10.3-ubuntu1~jammy1_amd64.deb glusterfs-server_10.3-ubuntu1~jammy1_amd64.deb libgfxdr0_10.3-ubuntu1~jammy1_amd64.deb  libglusterd0_10.3-ubuntu1~jammy1_amd64.deb libglusterfs0_10.3-ubuntu1~jammy1_amd64.deb libglusterfs-dev_10.3-ubuntu1~jammy1_amd64.deb
sulphur commented 1 year ago

I encountered the same "error=No space left on device" issue, even though I had free space. However, in my case, the partitions where the bricks are located have run out of i-nodes. I'm posting this here in case someone else experiences the same problem.

NHellFire commented 1 year ago

Setting storage.reserve (I used 5GB) on each volume fixed this for me with 10.4.

baskinsy commented 1 year ago

I have the same issue on a single brick distributed volume with 10.4. Stopping and starting the volume resolves it temporary. The storage.reserve 1GB didn't helped in our case.

baskinsy commented 1 year ago

We are constantly hitting this issue on a single brick distributed volume (the most simple type of volume and installation), no other nodes, only one node with one brick, no special settings, typical installation according to the documentation. It works for sometime after stop-start and then again the same. This is getting very frustrating and makes glusterfs unusable. Please provide packages to downgrade to 10.3.

dubsalicious commented 1 year ago

Having hit this same issue, I've attempted the storage.reserve fix with no luck. I also attempted to downgrade to version 10.3 (using Debian's packages) and 10.1 (Using the built in Ubuntu packages), but in both cases the volume wouldn't start because of an "undefined symbol" error. In one case it was "mem_pools" and in the other it was "mem_pools_init".

NHellFire commented 1 year ago

Setting storage.reserve (I used 5GB) on each volume fixed this for me with 10.4.

Update: That only fixed it temporarily. I'm now back to almost every write returning no space left, despite the least amount of free space in the cluster being 200GB. Setting storage.reserve is no longer making a difference. I've now upgraded all nodes to 11 and it's working again.

AmineYagoub commented 1 year ago

This is same issue on v 11, please any tutorial on how to downgrade to v10.3 on Ubuntu 22.04 ?

Arakmar commented 1 year ago

For those interested, I published fixed packages on my PPA for 22.04 and 20.04. It's based on official 10.4 packages plus the patch fixing the issue (8830f22b2428dbec7bf610341d91d748057236f1). The upgrade should be automatic if you are using packages from the official PPA. https://launchpad.net/~yoann-laissus/+archive/ubuntu/gluster

dubsalicious commented 1 year ago

Will this fix be included in a Gluster 10.5 release? The tenetative date for that to be released was 2 weeks ago so its possibly due imminently? I hope the PPA above helps people, unfortunately I don't think I'll have to wait for an official release.

lwierzch commented 11 months ago

My team is seeing the symptoms from this bug almost daily. Are there any updates on the release date?

baskinsy commented 11 months ago

For those interested, I published fixed packages on my PPA for 22.04 and 20.04. It's based on official 10.4 packages plus the patch fixing the issue (8830f22). The upgrade should be automatic if you are using packages from the official PPA. https://launchpad.net/~yoann-laissus/+archive/ubuntu/gluster

We can confirm that after installing the packages, restarting glusterd and a stop-start on the volume, the issue seems to have been resolved. Thank you.

baskinsy commented 10 months ago

10.5 is released. I was not able to verify if it includes the fix mentioned here and we cannot test it on our system. It would be good if someone can share that info.

mdetrano commented 10 months ago

I've done a little testing on 10.5 and it seems to be ok... by that I mean I ran a stress test script to write, read, and move files and let it run for a length of time where the same test would usually produce the "out of space" error. In 19 hours it didn't show any problems but that's just my basic test of a two node replication setup.

Franco-Sparrow commented 10 months ago

Hi, reactivating this issue.

I am using 10.4 and started to the see this issue on a "Two ways distributed replicated volume". Downgrade to 10.3 is not an option, as there are worse issues on that version , related to bricks disconnections that push us to upgrade to 10.4, that fix most of them and now 10.5 fix even more these errors.

Did anyone could confirm that 10.5 fix this issue? As for now, only @mdetrano made some tests under 10.5 and seems to be OK, but it will be nice if we help each others and share if this issue related with "no left space" was finally resolved.

Thanks in advance

Arakmar commented 10 months ago

After several months on now 10.5 and also a custom 10.4 build (with https://github.com/gluster/glusterfs/commit/8830f22b2428dbec7bf610341d91d748057236f1 which is included in 10.5), I can confirm the issue is definitely gone for us.

Franco-Sparrow commented 10 months ago

After several months on now 10.5 and also a custom 10.4 build (with 8830f22 which is included in 10.5), I can confirm the issue is definitely gone for us.

Thanks Sir

imagen

Looking on the bug fixes of 10.5...it looks like it was included the patch. Thanks also for your confirmation :)