gridcf / gct

Grid Community Toolkit
Apache License 2.0
46 stars 30 forks source link

Hold XIO server lock across fork #133

Closed fscheiner closed 2 years ago

fscheiner commented 4 years ago

@ellert, @matyasselmeci, @msalle:

Just to be sure, was @bbockelm's PR #63 for the Globus Toolkit ever merged into the GCT?

I don't know how such a deadlock shows in GridFTP server logs (@bbockelm: If you could provide an example that would be great.), but I assume this PR could be useful for the GCT. I assume this is mostly relevant for GridFTP servers.

msalle commented 4 years ago

Hi @fscheiner, short answer: no it hasn't been merged in. Whether it's still needed I don't know.

fscheiner commented 4 years ago

@msalle

Whether it's still needed I don't know.

Well, if it wasn't changed, I assume it's still a problem, also because of the description in https://github.com/globus/globus-toolkit/pull/63. If the deadlock behaviour can be easily reproduced, we can just compare the results of (1) using the unmodified source and (2) using a "patched" source, to determine if it's needed or not.

fscheiner commented 3 years ago

@msalle I seem to remember observing globus-gridftp-server processes that keep running w/o producing any load after past transfers were finished. This could be related to the deadlocks described in https://github.com/globus/globus-toolkit/pull/63. So I will try to reproduce this behaviour with the current GCT code locally and then compare to the behaviour with @bbockelm's patch included.

ellert commented 2 years ago

The lock is already held during fork().

The fork is done in globus_l_gfs_spawn_child(): https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L439-L443 https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L466 https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L561

And though this function does not lock the mutex, it is called from

https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L1154-L1165 https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L1198 https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L1270-L1273

I.e. the lock is already held when globus_l_gfs_spawn_child() is called. There is no need to lock it again in this function. Locking the same mutex again from the same thread would be undefined behaviour.

ellert commented 2 years ago

Never mind - I now realise that the original PR was for a different mutex than this one. Please disregard my comment.

ellert commented 2 years ago

Hi again.

I looked at this as part of preparing my "fix some compiler warnings" PR #179. I now think that my comment above was not so off the mark after all. In the source file in question - gridftp/server/src/globus_gridftp_server.c - the mutex globus_l_gfs_xio_server->mutex is never locked or released. All manipulations of globus_l_gfs_xio_server (assignments, function calls, ...) are (or at least are supposed to be) protected by holding the global mutex globus_l_gfs_mutex. So any locking of mutex globus_l_gfs_xio_server->mutex should not be able to persist if the global mutex is released, and hence my comment above is valid.

However, there is a flaw to this logic - which I think is a bug. For some reason, the lock is released here:

https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L538

inside the globus_l_gfs_spawn_child() function, even tough the logic demands that the mutex should be locked for the duration of this function. The logic demands that the mutex stays locked after returning from the call to globus_l_gfs_spawn_child() and is released by the call to globus_mutex_unlock on line 1270:

https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L1192-L1286

As can be seen above, there are some function calls after the return from globus_l_gfs_spawn_child() than uses globus_l_gfs_xio_server, that could potentially screw up the locking on mutex globus_l_gfs_xio_server->mutex without the global lock being locked due to the early release of the lock on line 538.

This unlock also makes no sense for a different reason. Every other call to globus_xio_register_close() is done while the mutex is locked. Here the mutex is released just before calling this function, which makes the call questionable for a different reason.

Dropping the call to release the lock early would fix this.

The direct call to globus_l_gfs_close_cb() in case the callback registration fails must then be handled in some way, since the callback tries to acquire the lock. This could be

        if(result != GLOBUS_SUCCESS)
        {
            globus_mutex_unlock(&globus_l_gfs_mutex);
            globus_l_gfs_close_cb(handle, result, NULL);
            globus_mutex_lock(&globus_l_gfs_mutex);
        }        

But the callback function is: https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L583-L601

So this would be equivalent to:

        if(result != GLOBUS_SUCCESS)
        {
            globus_mutex_unlock(&globus_l_gfs_mutex);
            globus_mutex_lock(&globus_l_gfs_mutex);
            globus_i_gfs_connection_closed();
            globus_mutex_unlock(&globus_l_gfs_mutex);
            globus_mutex_lock(&globus_l_gfs_mutex);
        }        

i.e., one could just write:

        if(result != GLOBUS_SUCCESS)
        {
            globus_i_gfs_connection_closed();
        }        

Which would be consistent with the rest of the code in the file, since in other places where, if a callback registration fails in a place where the lock is held, the backup is to execute the code of the callback without the lock/unlock, e.g.:

https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L905-L920

https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L866-L885

Coincidently, this change is proposed in the PR #179: https://github.com/gridcf/gct/pull/179/files#diff-f76f3cae29cf9ff2977b4124b6177a87eb0b1500c72908a7ca6194a6c0e21bd8

fscheiner commented 2 years ago

https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L538

This came in some 18 years ago. Prior to that, the child was spawned before the locking of globus_l_gfs_mutex had happened (see https://github.com/gridcf/gct/blob/7e6fa4eb6c60327fa663c7801461d72ea5e2ff0e/gridftp/server/src/globus_gridftp_server.c#L435-L462 ...for details). Not sure what the reasons were to change that back then.


But is globus_l_gfs_mutex for the parent the same mutex as globus_l_gfs_xio_server->mutex for the spawned child? I ask, because if not, I don't see a chance for a deadlock in the child. But as @bbockelm detected one in https://github.com/globus/globus-toolkit/pull/63, there must have been one in the code (at least back then).

Won't there be a real chance for a deadlock for the parent in https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L539-L543 ...,after removing the globus_mutex_unlock() in line 538, because inside the call in line 539 there happens a globus_mutex_lock() on handle->context->mutex (see https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/xio/src/globus_xio_handle.c#L2604) which - if this mutex is the same mutex as globus_l_gfs_mutex - can't succeed until globus_l_gfs_mutex is unlocked in line 1270.

UPDATE: But maybe not, because if both were the same, the deadlock would - without any changes to line 538 above - already happen in the call to globus_xio_register_open() in https://github.com/gridcf/gct/blob/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a/gridftp/server/src/globus_gridftp_server.c#L1180 (which itself calls globus_l_xio_register_open() which then calls globus_mutex_lock(&handle->context->mutex) before the call to globus_l_gfs_spawn_child() happens).

I'll check if the changes from #179 show any differences in behaviour for the Globus GridFTP server compared to the current code state (@https://github.com/gridcf/gct/commit/da14279fd3738d3b820e2aa5e8dc0dc1630c7a3a). I still wonder how the mentioned deadlocks could happen.

fscheiner commented 2 years ago

I assume this is no longer relevant or needed with the inclusion of #179 (specifically https://github.com/gridcf/gct/pull/179/files#diff-f76f3cae29cf9ff2977b4124b6177a87eb0b1500c72908a7ca6194a6c0e21bd8).