nfs-ganesha / nfs-ganesha

NFS-Ganesha is an NFSv3,v4,v4.1 fileserver that runs in user mode on most UNIX/Linux systems
1.45k stars 505 forks source link

Attempting to read from a file over 9P causes ganesha to fall over #1042

Open dhowells opened 6 months ago

dhowells commented 6 months ago

I'm trying to test the Linux 9p filesystem, but when I try to read from a file, ganesha.nfsd dies with SEGV. Valgrind shows the problem:

==38960== Memcheck, a memory error detector ==38960== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==38960== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==38960== Command: ./ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -F ==38960== ==38960== Thread 138: ==38960== Invalid read of size 4 ==38960== at 0x4DC32D6: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.c:41) ==38960== by 0x489700C: sync_cb (fsal_helper.c:1837) ==38960== by 0x49D79DF: mdc_read_super_cb (mdcache_file.c:559) ==38960== by 0x49D7AC5: mdc_read_cb (mdcache_file.c:582) ==38960== by 0x7B4B81F: vfs_read2 (file.c:1317) ==38960== by 0x49D7BCF: mdcache_read2 (mdcache_file.c:617) ==38960== by 0x4897173: fsal_read (fsal_helper.c:1849) ==38960== by 0x4A10FD4: _9p_read (9p_read.c:134) ==38960== by 0x4A0A024: _9p_process_buffer (9p_interpreter.c:181) ==38960== by 0x4A09DCB: _9p_tcp_process_request (9p_interpreter.c:133) ==38960== by 0x48CE182: _9p_execute (9p_dispatcher.c:315) ==38960== by 0x48CE508: _9p_worker_run (9p_dispatcher.c:412) ==38960== Address 0x24 is not stack'd, malloc'd or (recently) free'd

This Ganesha was built from the nfs-ganesha-5.7 tarball and the two patches included in the Fedora 39 srpm. The configuration was:

cmake ~/nfs-ganesha-5.7/src/ -DUSE_SYSTEM_NTIRPC=ON -DUSE_FSAL_RGW=OFF

and the F39 system gcc-13.2.1 was used. In gdb, I see:

0 0x00007ffff797c2d6 in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libc.so.6

1 0x00007ffff7c4100d in sync_cb (obj=0x7fff38000ec8, ret=..., args=0x7fffd67e9ee0, caller_data=0x7fffd67e9f50) at /root/nfs-ganesha-5.7/src/FSAL/fsal_helper.c:1837

2 0x00007ffff7d819e0 in mdc_read_super_cb (obj=0x7fff38000cd0, ret=..., obj_data=0x7fffd67e9ee0, caller_data=0x7fff28000b70)

at /root/nfs-ganesha-5.7/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:559

3 0x00007ffff7d81ac6 in mdc_read_cb (obj=0x7fff38000cd0, ret=..., obj_data=0x7fffd67e9ee0, caller_data=0x7fff28000b70)

at /root/nfs-ganesha-5.7/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:582

4 0x00007ffff76f7820 in vfs_read2 () from /usr/lib64/ganesha/libfsalvfs.so

5 0x00007ffff7d81bd0 in mdcache_read2 (obj_hdl=0x7fff38000ec8, bypass=true, done_cb=0x7ffff7c40e94 , read_arg=0x7fffd67e9ee0, caller_arg=0x7fffd67e9f50)

at /root/nfs-ganesha-5.7/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:617

6 0x00007ffff7c41174 in fsal_read (obj_hdl=0x7fff38000ec8, bypass=true, arg=0x7fffd67e9ee0, data=0x7fffd67e9f50) at /root/nfs-ganesha-5.7/src/FSAL/fsal_helper.c:1849

7 0x00007ffff7dbafd5 in _9p_read (req9p=0x7fff44011580, plenout=0x7fffd67fb1b8, preply=0x7fffd67ea040 "") at /root/nfs-ganesha-5.7/src/Protocols/9P/9p_read.c:134

8 0x00007ffff7db4025 in _9p_process_buffer (req9p=0x7fff44011580, replydata=0x7fffd67ea040 "", poutlen=0x7fffd67fb1b8)

at /root/nfs-ganesha-5.7/src/Protocols/9P/9p_interpreter.c:181

9 0x00007ffff7db3dcc in _9p_tcp_process_request (req9p=0x7fff44011580) at /root/nfs-ganesha-5.7/src/Protocols/9P/9p_interpreter.c:133

10 0x00007ffff7c78183 in _9p_execute (req9p=0x7fff44011580) at /root/nfs-ganesha-5.7/src/MainNFSD/9p_dispatcher.c:315

11 0x00007ffff7c78509 in _9p_worker_run (ctx=0x32db2c0) at /root/nfs-ganesha-5.7/src/MainNFSD/9p_dispatcher.c:412

12 0x00007ffff7cfb98e in fridgethr_start_routine (arg=0x32db2c0) at /root/nfs-ganesha-5.7/src/support/fridgethr.c:486

13 0x00007ffff797d897 in start_thread () from /lib64/libc.so.6

14 0x00007ffff7a046fc in clone3 () from /lib64/libc.so.6

with line 1837 of sync_cb() being:

1837 pthread_cond_signal(data->fsa_cond);

(gdb) p *data $2 = {ret = {major = ERR_FSAL_NO_ERROR, minor = 0}, done = true, fsa_mutex = 0x7fffd67fb320, fsa_cond = 0x0}

so data->fsa_cond is NULL.

martinetd commented 6 months ago

That segv can be fixed with this patch:

diff --git a/src/MainNFSD/9p_dispatcher.c b/src/MainNFSD/9p_dispatcher.c
index abf35a452310..8e0380d9dbd8 100644
--- a/src/MainNFSD/9p_dispatcher.c
+++ b/src/MainNFSD/9p_dispatcher.c
@@ -407,7 +407,7 @@ static void _9p_worker_run(struct fridgethr_context *ctx)
            continue;

        reqdata->_9prq_mutex = &_9pw_mutex;
-       reqdata->_9prq_mutex = &_9pw_mutex;
+       reqdata->_9prq_cond = &_9pw_cond;

        _9p_execute(reqdata);
        _9p_free_reqdata(reqdata);

But even fixing that a simple read just hangs there, so there's obviously more problems waiting. I don't have time to dig right now, sorry...

Given that particular bug has been here since april already (introduced in a4e52ddf16ac565380bc8608c6de65229224cb23 ) I think it's safe to say no-one cares about 9p support in ganesha at this point: @phdeniel @patlucas are you still using it?

If not I'd be in favor of just removing the 9p code; I can eventually get around to fixing it but it's not reasonable to just expect it to keep working if nobody's running tests and looking at it in at least 8 months.

martinetd commented 6 months ago

Ok so the hang is because we're not initializing read/write args and the fsal helpers loops on fsal_resume, a memset helps:

diff --git a/src/Protocols/9P/9p_read.c b/src/Protocols/9P/9p_read.c
index 284dc23910e2..b96dc91a2834 100644
--- a/src/Protocols/9P/9p_read.c
+++ b/src/Protocols/9P/9p_read.c
@@ -114,6 +114,7 @@ int _9p_read(struct _9p_request_data *req9p, u32 *plenout, char *preply)
        struct async_process_data read_data;
        struct fsal_io_arg *read_arg = alloca(sizeof(*read_arg) +
                            sizeof(struct iovec));
+       memset(read_arg, 0, sizeof(*read_arg));

        read_arg->info = NULL;
        read_arg->state = pfid->state;
diff --git a/src/Protocols/9P/9p_write.c b/src/Protocols/9P/9p_write.c
index 44b48d1d3dbf..9a447732e505 100644
--- a/src/Protocols/9P/9p_write.c
+++ b/src/Protocols/9P/9p_write.c
@@ -134,6 +134,7 @@ int _9p_write(struct _9p_request_data *req9p, u32 *plenout, char *preply)
        struct async_process_data write_data;
        struct fsal_io_arg *write_arg = alloca(sizeof(*write_arg) +
                              sizeof(struct iovec));
+       memset(write_arg, 0, sizeof(*write_arg));

        write_arg->info = NULL;
        write_arg->state = pfid->state;

After that, read/writes seem to work but creating a file also seg faults; I didn't try to understand aa00d4136865e5f8ec8296faa67fcd9084771bf2 (Sept 2022, we're now over a year broken) but I'd assume some confusion between ppentry and pentry? This works around the immediate problem even if it's not perfect:

diff --git a/src/Protocols/9P/9p_lcreate.c b/src/Protocols/9P/9p_lcreate.c
index 4c99a8de6e8b..eb03dccb0051 100644
--- a/src/Protocols/9P/9p_lcreate.c
+++ b/src/Protocols/9P/9p_lcreate.c
@@ -165,7 +165,7 @@ int _9p_lcreate(struct _9p_request_data *req9p, u32 *plenout, char *preply)
    pfid->opens = 1;

    /* Get a active reference */
-   pfid->ppentry->obj_ops->get_ref(pfid->ppentry);
+   pfid->pentry->obj_ops->get_ref(pfid->pentry);

    /* Build the reply */
    _9p_setinitptr(cursor, preply, _9P_RLCREATE);
diff --git a/src/Protocols/9P/9p_proto_tools.c b/src/Protocols/9P/9p_proto_tools.c
index e6e9c019c85d..ecb939308515 100644
--- a/src/Protocols/9P/9p_proto_tools.c
+++ b/src/Protocols/9P/9p_proto_tools.c
@@ -408,7 +408,7 @@ int _9p_tools_clunk(struct _9p_fid *pfid)
         */
        while (pfid->opens > 0) {
            /* Release the active reference for each open */
-           pfid->ppentry->obj_ops->put_ref(pfid->ppentry);
+           pfid->pentry->obj_ops->put_ref(pfid->pentry);
            pfid->opens--;
        }

At this point I can create files, but trying to read it back fails an assert check in mdcache:

ganesha.nfsd: /code/nfs-ganesha/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1972: _mdcache_lru_unref: Assertion `!entry->fh_hk.inavl' failed.
Aborted (core dumped)

And I've given up here; this requires more than a few hours every couple of years; I don't have this kind of free time at the moment.

So unless someone else will actively maintain it, it's probably not worth fixing at this point...