Closed Deltik closed 3 months ago
It seems the brick process is getting stack overflow during unref of namespace_inode, the ns_inode was introduced by the patch (https://github.com/gluster/glusterfs/pull/1763/files), @amarts Can you please share your view on the same?
@mohit84 or @amarts: Is there a workaround to avoid the crash? And do you need any more information from me to debug this one?
The brick keeps crashing and causing shared storage outages. I'm hoping to offer a remedy to the customer before their scheduled go-live next week.
Except applying the patch to fix the issue i don;t think there is any other solution/workaround, either you have to revert the patch ((https://github.com/gluster/glusterfs/pull/1763/files) or you have to apply a patch to fix the crash. I will try to share the patch to fix the same , let me know if you are interested to test the same in your environment.
Thanks for the reply, @mohit84. https://github.com/gluster/glusterfs/pull/1763 doesn't revert cleanly, so I'm willing to test a patch from you to fix the crash.
Can you please try to apply below patch in your environment and share the result.
diff --git a/libglusterfs/src/inode.c b/libglusterfs/src/inode.c
index 64ea78c6b2..59d7be9ffe 100644
--- a/libglusterfs/src/inode.c
+++ b/libglusterfs/src/inode.c
@@ -351,7 +351,17 @@ __inode_ctx_free(inode_t *inode)
static void
__inode_destroy(inode_t *inode)
{
- inode_unref(inode->ns_inode);
+ inode_table_t *table = NULL;
+ inode_t *ns_inode = inode->ns_inode;
+
+ if (ns_inode) {
+ table = ns_inode->table;
+ pthread_mutex_lock(&table->lock);
+ {
+ __inode_unref(ns_inode, false);
+ }
+ pthread_mutex_unlock(&table->lock);
+ }
__inode_ctx_free(inode);
Thank you for the patch. It'll take some time to roll out on my end. I'll report back perhaps in a week or two whether the brick seems to be stable once it's been running a while.
Thanks for confirmation, i will wait your response.
@mohit84: The patch appears to stabilize glusterfsd
. There have been no reports of the brick crashing since deploying the fix on 19 January 2024, 5 days ago.
Thanks for confirming it, let's wait one more week. I will upload a patch in next week.
Thanks for confirming it, let's wait one more week. I will upload a patch in next week.
I don't know enough about the underlying code, so pardon my ignorance on this one, but since the code appears to be related to inodes, will this patch resolve the infinite "inode path not completely resolved. Asking for full path" log entries in the brick logs or is this unrelated?
Also when you say upload a patch, I assume that means it has to manually be applied, not added to the repos for updates via apt? Thanks in advance.
Thanks for confirming it, let's wait one more week. I will upload a patch in next week.
I don't know enough about the underlying code, so pardon my ignorance on this one, but since the code appears to be related to inodes, will this patch resolve the infinite "inode path not completely resolved. Asking for full path" log entries in the brick logs or is this unrelated?
Also when you say upload a patch, I assume that means it has to manually be applied, not added to the repos for updates via apt? Thanks in advance.
I don't think it should be related to this, the message are throwing by brick only during gfid based lookup. Ideally the message should be DEBUG message but somehow it was implemented as a INFO message. For specific to a patch the patch is already merged in devel branch, the pull request was already generated to backport the same in release-11.
For specific to getting crash can you please share "thread apply all bt full" output after attach a core with gdb, I have asked in past also but i did not get any update so it is difficult to find out RCA.
For specific to getting crash can you please share "thread apply all bt full" output after attach a core with gdb, I have asked in past also but i did not get any update so it is difficult to find out RCA.
Are you asking me for another backtrace?
For specific to getting crash can you please share "thread apply all bt full" output after attach a core with gdb, I have asked in past also but i did not get any update so it is difficult to find out RCA.
Are you asking me for another backtrace?
Not from you i was asking @edrock200 to share the backtrace.
Thanks for confirming it, let's wait one more week. I will upload a patch in next week.
I don't know enough about the underlying code, so pardon my ignorance on this one, but since the code appears to be related to inodes, will this patch resolve the infinite "inode path not completely resolved. Asking for full path" log entries in the brick logs or is this unrelated? Also when you say upload a patch, I assume that means it has to manually be applied, not added to the repos for updates via apt? Thanks in advance.
I don't think it should be related to this, the message are throwing by brick only during gfid based lookup. Ideally the message should be DEBUG message but somehow it was implemented as a INFO message. For specific to a patch the patch is already merged in devel branch, the pull request was already generated to backport the same in release-11.
For specific to getting crash can you please share "thread apply all bt full" output after attach a core with gdb, I have asked in past also but i did not get any update so it is difficult to find out RCA.
My apologies @mohit84 . At the time you asked, my knowledge of how to conduct such a task was lacking. Wasn't intentionally ignoring your request. I believe I know how to do this now. That being said, yesterday I turned off nl-cache on said volume, and the errors appear to have disapated. Too soon to tell, will let it burn in for 48h or so. The nl-cache setting seems to also prevent heals from commencing when a brick is replaced fwiw. Also apologies for hijacking the thread. If issue resurfaces I will open a new issue. If not will update here.
I've filed https://bugs.launchpad.net/ubuntu/+source/glusterfs/+bug/2064843 to try and get this patched as it now affects Ubuntu 24.04. @mohit84 would it be possible to cut a release containing #4302?
I've filed https://bugs.launchpad.net/ubuntu/+source/glusterfs/+bug/2064843 to try and get this patched as it now affects Ubuntu 24.04. @mohit84 would it be possible to cut a release containing #4302?
@aravindavk can confirm about the release, now Red Hat is not maintaining the glusterfs so i am not sure about the next release.
Bug Description
There is a stack overflow that crashes the GlusterFS brick process,
glusterfsd
, with SIGSEGV.pl_readdirp()
callsposix_acl_readdirp()
, which callsbr_stub_readdirp()
, which callsposix_readdirp()
, which callsposix_do_readdir()
, which callsgf_dirent_free()
, which callsgf_dirent_entry_free()
, which callsinode_unref()
.inode_unref()
then repeatedly nests calls to itself viainode_table_prune()
→__inode_destroy()
→inode_unref()
until the program is killed and the brick goes offline.How to Reproduce
It is not known how to trigger this crash, but the core dump suggests that the bitrot daemon and
readdir()
are involved. Oddly, the bitrot daemon is not enabled (features.bitrot: off
on the volume).The cmdline of the brick process is as follows:
Failure Output
Segmentation fault (core dumped)
Excerpt from the backtrace at the moment of the crash:
Expected Behavior
No crash of
glusterfsd
, especially when the bitrot daemon is disabled- The operating system / glusterfs version: Debian 12 running GlusterFS 11.1