nfs-ganesha / ntirpc

New development on tirpc
25 stars 49 forks source link

svc_destroy_it will destroy an xprt without xp_ops initialized #288

Open xueqianhu47 opened 7 months ago

xueqianhu47 commented 7 months ago

During the shutdown process in Ganesha, the sequence svc_shutdown() -> svc_xprt_shutdown() -> svc_destroy_it() is invoked to destroy all transports (xprts) in the svc_xprt_fd tree. We encountered a crash within svc_destroy_it() due to a transport (xprt) lacking initialized xp_ops being processed by this function. Examination of this xprt via the core file revealed it was not fully initialized, with most of its fields remaining unset (zeroed).

Upon reviewing the allocation, insertion, and initialization process for new xprts within the svc_xprt_fd tree, we identified the absence of synchronization mechanisms ensuring that an xprt is fully initialized before its insertion into the tree. For instance, svc_vc_ncreatef() first creates a new xprt and inserts it into the svc_xprt_fd tree using svc_xprt_lookup, followed by the initialization of xp_ops through svc_vc_rendezvous_ops(). However, there is no safeguard to guarantee that these steps are executed atomically. Should a Ganesha shutdown occur midway, a crash could result in svc_destroy_it due to a null pointer dereference. A similar process for creating an xprt is also present in svc_fd_ncreatef().

Is there a strategy to mitigate the aforementioned issues?

ffilz commented 7 months ago

See also Ganesha issues:

https://github.com/nfs-ganesha/nfs-ganesha/issues/1073 https://github.com/nfs-ganesha/nfs-ganesha/issues/1086

I think there needs to be more synchronization for svc_shutdown.

xueqianhu47 commented 7 months ago

@ffilz this issue got hit again in another case: svc_rqst_xprt_task_recv -> svc_vc_rendezvous -> svc_destroy_it and xp_ops == null.

In svc_vc_rendezvous, svc_destroy_it will be called when makefd_xprt finds an existing xprt and the xprt does not have the SVC_XPRT_FLAG_INITIAL flag.

    newxprt = makefd_xprt(fd, req_xd->sx_dr.sendsz, req_xd->sx_dr.recvsz,
                  &si, SVC_XPRT_FLAG_CLOSE);
    if ((!newxprt) || (!(newxprt->xp_flags & SVC_XPRT_FLAG_INITIAL))) {
        close(fd);

        if (newxprt) {
            SVC_DESTROY(newxprt);
            /* Was never added to epoll */
            SVC_RELEASE(newxprt, SVC_RELEASE_FLAG_NONE);
        } else {
            close(fd);
        }

Looks like this is a case that an xprt without SVC_XPRT_FLAG_INITIAL has not yet been initialized for the xp_ops.

The SVC_DESTROY call was added last year, we used to just close(fd): https://github.com/nfs-ganesha/ntirpc/pull/269/commits/ac75536950e98e3a6c79d45dd1ebcfaebc5c029f I am just thinking, what if the svc_vc_rendezvous get called twice, the first time we get a newly created xprt(xp_ops has not yet been initialized), then second call comes, makefd_xprt-> svc_xprt_lookup finds the newly created xprt and clear the SVC_XPRT_FLAG_INITIAL flag:

    rpc_dplx_rli(rec);
    xp_flags = atomic_clear_uint16_t_bits(&xprt->xp_flags,
                          SVC_XPRT_FLAG_INITIAL);
    rpc_dplx_rui(rec);

and try to detroy the xprt when be back in svc_vc_rendezvous. Would you this could be the case

ffilz commented 4 months ago

Has a proposed fix here: https://github.com/nfs-ganesha/ntirpc/pull/289