nfs-ganesha / nfs-ganesha

NFS-Ganesha is an NFSv3,v4,v4.1 fileserver that runs in user mode on most UNIX/Linux systems
1.48k stars 514 forks source link

[V5.3][NFSv4.2/(krb5,krb5i,krb5p)] NFS-Ganesha crashes just after client mount #959

Closed gpenin closed 8 months ago

gpenin commented 1 year ago

Using a fresh NFS-Ganesha V5.3 (GlusterFS 11.0 backend) build on Ubuntu 22.04 LTS, NFS-Ganesha crashes just after client mount (NFSv4.2/krb5p) :

Jun 12 14:35:42 ughdybnt01 systemd[1]: nfs-ganesha.service: Main process exited, code=killed, status=6/ABRT Jun 12 14:35:42 ughdybnt01 systemd[1]: nfs-ganesha.service: Failed with result 'signal'.

Seems to be related to a mutex acquisition just after it has been destroyed :

12/06/2023 14:35:42 : epoch 64871111 : ughdybnt01 : ganesha.nfsd-25765[svc_3] uid2grp_allocate_by_uid :RW LOCK :F_DBG :Init mutex 0x7f1ade042bf0 (&gdata->gd_lock) at /home/mockbuild/nfs-ganesha/nfs-ganesha/src/support/uid2grp.c:281
12/06/2023 14:35:42 : epoch 64871111 : ughdybnt01 : ganesha.nfsd-25765[svc_3] uid2grp_hold_group_data :RW LOCK :F_DBG :Acquired mutex 0x7f1ade042bf0 (&gdata->gd_lock) at /home/mockbuild/nfs-ganesha/nfs-ganesha/src/support/uid2grp.c:69
12/06/2023 14:35:42 : epoch 64871111 : ughdybnt01 : ganesha.nfsd-25765[svc_3] uid2grp_hold_group_data :RW LOCK :F_DBG :Released mutex 0x7f1ade042bf0 (&gdata->gd_lock) at /home/mockbuild/nfs-ganesha/nfs-ganesha/src/support/uid2grp.c:71
12/06/2023 14:35:42 : epoch 64871111 : ughdybnt01 : ganesha.nfsd-25765[svc_3] uid2grp_hold_group_data :RW LOCK :F_DBG :Acquired mutex 0x7f1ade042bf0 (&gdata->gd_lock) at /home/mockbuild/nfs-ganesha/nfs-ganesha/src/support/uid2grp.c:69
12/06/2023 14:35:42 : epoch 64871111 : ughdybnt01 : ganesha.nfsd-25765[svc_3] uid2grp_hold_group_data :RW LOCK :F_DBG :Released mutex 0x7f1ade042bf0 (&gdata->gd_lock) at /home/mockbuild/nfs-ganesha/nfs-ganesha/src/support/uid2grp.c:71
12/06/2023 14:35:42 : epoch 64871111 : ughdybnt01 : ganesha.nfsd-25765[svc_3] uid2grp_release_group_data :RW LOCK :F_DBG :Acquired mutex 0x7f1ade042bf0 (&gdata->gd_lock) at /home/mockbuild/nfs-ganesha/nfs-ganesha/src/support/uid2grp.c:78
12/06/2023 14:35:42 : epoch 64871111 : ughdybnt01 : ganesha.nfsd-25765[svc_3] uid2grp_release_group_data :RW LOCK :F_DBG :Released mutex 0x7f1ade042bf0 (&gdata->gd_lock) at /home/mockbuild/nfs-ganesha/nfs-ganesha/src/support/uid2grp.c:80
12/06/2023 14:35:42 : epoch 64871111 : ughdybnt01 : ganesha.nfsd-25765[svc_3] uid2grp_release_group_data :RW LOCK :F_DBG :Destroy mutex 0x7f1ade042bf0 (&gdata->gd_lock) at /home/mockbuild/nfs-ganesha/nfs-ganesha/src/support/uid2grp.c:81
12/06/2023 14:35:42 : epoch 64871111 : ughdybnt01 : ganesha.nfsd-25765[svc_9] uid2grp_hold_group_data :RW LOCK :CRIT :Error 22, acquiring mutex 0x7f1ade042bf0 (&gdata->gd_lock) at /home/mockbuild/nfs-ganesha/nfs-ganesha/src/support/uid2grp.c:69

Please find full debug log in attachment.

ganesha.log.gz

ffilz commented 1 year ago

Do you have the stack backtrace also?

gpenin commented 1 year ago

Backtrace :

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140401960810048) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140401960810048) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140401960810048, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007fb1f0fdd476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fb1f0fc37f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007fb1f15c1b55 in uid2grp_hold_group_data (gdata=0x7fb1e107d070) at /usr/local/src/nfs-ganesha/src/support/uid2grp.c:69
#6  0x00007fb1f15c378d in uid2grp (uid=0, gdata=0x7fb1f03d8b78) at /usr/local/src/nfs-ganesha/src/support/uid2grp.c:349
#7  0x00007fb1f158e55a in nfs_req_creds (req=0x7fb1f03d8400) at /usr/local/src/nfs-ganesha/src/support/nfs_creds.c:419
#8  0x00007fb1f158f424 in nfs4_export_check_access (req=0x7fb1f03d8400) at /usr/local/src/nfs-ganesha/src/support/nfs_creds.c:623
#9  0x00007fb1f15eb4ff in nfs4_mds_putfh (data=0x7fb1e4f51a00) at /usr/local/src/nfs-ganesha/src/Protocols/NFS/nfs4_op_putfh.c:181
#10 0x00007fb1f15eb908 in nfs4_op_putfh (op=0x7fb1e55e3410, data=0x7fb1e4f51a00, resp=0x7fb1f0225aa0) at /usr/local/src/nfs-ganesha/src/Protocols/NFS/nfs4_op_putfh.c:282
#11 0x00007fb1f15ce8ff in process_one_op (data=0x7fb1e4f51a00, status=0x7fb1e0ffbeac) at /usr/local/src/nfs-ganesha/src/Protocols/NFS/nfs4_Compound.c:912
#12 0x00007fb1f15cff1c in nfs4_Compound (arg=0x7fb1f03d8c28, req=0x7fb1f03d8400, res=0x7fb1e4677780) at /usr/local/src/nfs-ganesha/src/Protocols/NFS/nfs4_Compound.c:1376
#13 0x00007fb1f1506b57 in nfs_rpc_process_request (reqdata=0x7fb1f03d8400, retry=false) at /usr/local/src/nfs-ganesha/src/MainNFSD/nfs_worker_thread.c:1517
#14 0x00007fb1f1507152 in nfs_rpc_valid_NFS (req=0x7fb1f03d8400) at /usr/local/src/nfs-ganesha/src/MainNFSD/nfs_worker_thread.c:1732
#15 0x00007fb1f0f72ffa in svc_vc_decode (req=0x7fb1f03d8400) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_vc.c:1044
#16 0x00007fb1f0f6e057 in svc_request (xprt=0x7fb1e100c000, xdrs=0x7fb1e4b72000) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1202
#17 0x00007fb1f0f72ede in svc_vc_recv (xprt=0x7fb1e100c000) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_vc.c:1017
#18 0x00007fb1f0f6dfd3 in svc_rqst_xprt_task_recv (wpe=0x7fb1e100c2c8) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1183
#19 0x00007fb1f0f6ebf4 in svc_rqst_epoll_loop (wpe=0x7fb1f02c5518) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1564
#20 0x00007fb1f0f7ad5d in work_pool_thread (arg=0x7fb1e1000000) at /usr/local/src/nfs-ganesha/src/libntirpc/src/work_pool.c:183
#21 0x00007fb1f102fb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#22 0x00007fb1f10c1a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Coredump in attachment. core.ganesha-nfsd.gz

ffilz commented 1 year ago

Oh, crud.... when group_data refcount is reduced, we destroy the mutex... OOPS...

ffilz commented 1 year ago

Oh, double ick... the lifecycle management of group data is horrid...

ffilz commented 1 year ago

Try this patch: https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/555411?usp=search

It moves the PTHREAD_MUTEX_destroy so it isn't destroyed on every unref...

I think in fact it will never be destroyed because when the group_data is removed from the avl tree, it doesn't unref... So there needs to be some fixing up of that table management, but this should at least return things back to what they were before I added the PTHREAD_MUTEX_destroy...

gpenin commented 1 year ago

Try this patch: https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/555411?usp=search

It moves the PTHREAD_MUTEX_destroy so it isn't destroyed on every unref...

I think in fact it will never be destroyed because when the group_data is removed from the avl tree, it doesn't unref... So there needs to be some fixing up of that table management, but this should at least return things back to what they were before I added the PTHREAD_MUTEX_destroy...

Thanks for the patch, NFS-Ganesha does not crash anymore with this patch while mounting.

It's probably totally unrelated, but it now crashes while making I/O on a file (read, write, create, remove, etc...) on the mounted FS.

The backtrace (rm file) :

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007f710f9af87b in close_fsal_fd (obj_hdl=0x7f7106855cf0, fsal_fd=0x7f7106855c28, is_reclaiming=false) at /usr/local/src/nfs-ganesha/src/FSAL/commonlib.c:1945
#2  0x00007f710eb1fba7 in file_close (obj_hdl=0x7f7106855cf0) at /usr/local/src/nfs-ganesha/src/FSAL/FSAL_GLUSTER/handle.c:1341
#3  0x00007f710fb148fb in mdcache_close (obj_hdl=0x7f710685ed38) at /usr/local/src/nfs-ganesha/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:118
#4  0x00007f710f9c087d in fsal_close (obj_hdl=0x7f710685ed38) at /usr/local/src/nfs-ganesha/src/include/fsal.h:374
#5  0x00007f710f9c40a1 in fsal_remove (parent=0x7f7103aa2538, name=0x7f7106837640 "toto.bin") at /usr/local/src/nfs-ganesha/src/FSAL/fsal_helper.c:1435
#6  0x00007f710fad5fa7 in nfs4_op_remove (op=0x7f7106884120, data=0x7f710686b180, resp=0x7f7106856540) at /usr/local/src/nfs-ganesha/src/Protocols/NFS/nfs4_op_remove.c:103
#7  0x00007f710fab2903 in process_one_op (data=0x7f710686b180, status=0x7f70fecf8eac) at /usr/local/src/nfs-ganesha/src/Protocols/NFS/nfs4_Compound.c:912
#8  0x00007f710fab3f20 in nfs4_Compound (arg=0x7f710682cc28, req=0x7f710682c400, res=0x7f710688bc80) at /usr/local/src/nfs-ganesha/src/Protocols/NFS/nfs4_Compound.c:1376
#9  0x00007f710f9eab57 in nfs_rpc_process_request (reqdata=0x7f710682c400, retry=false) at /usr/local/src/nfs-ganesha/src/MainNFSD/nfs_worker_thread.c:1517
#10 0x00007f710f9eb152 in nfs_rpc_valid_NFS (req=0x7f710682c400) at /usr/local/src/nfs-ganesha/src/MainNFSD/nfs_worker_thread.c:1732
#11 0x00007f710f456ffa in svc_vc_decode (req=0x7f710682c400) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_vc.c:1044
#12 0x00007f710f452057 in svc_request (xprt=0x7f70ff00c000, xdrs=0x7f710686b000) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1202
#13 0x00007f710f456ede in svc_vc_recv (xprt=0x7f70ff00c000) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_vc.c:1017
#14 0x00007f710f451fd3 in svc_rqst_xprt_task_recv (wpe=0x7f70ff00c2c8) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1183
#15 0x00007f710f452bf4 in svc_rqst_epoll_loop (wpe=0x7f710e6c5518) at /usr/local/src/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1564
#16 0x00007f710f45ed5d in work_pool_thread (arg=0x7f70ff0000a0) at /usr/local/src/nfs-ganesha/src/libntirpc/src/work_pool.c:183
#17 0x00007f710f513b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#18 0x00007f710f5a5a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Please find full debug log and coredump in attachment.

Feel free to ask for another issue if preferred. ganesha.log.gz core.ganesha-nfsd.gz

ffilz commented 1 year ago

Oh, fudge... I managed to not fully convert FSAL_GLUSTER for the fsal_fd changes I made... See issue #960

ffilz commented 1 year ago

You won't be able to truly verify this one until FSAL_GLUSTER is fixed...

ffilz commented 1 year ago

Could you try with V5.5?

gpenin commented 1 year ago

Hi @ffilz,

I confirm it's now OK with V5.5.2.

Regards,

ffilz commented 1 year ago

Great. Thanks.

ffilz commented 8 months ago

Merged in V5.3.1