Open ijrsvt opened 3 years ago
cc @iycheng Can you add some of your findings here?
cc @ericl @clarkzinzow @iycheng
It crashed at java's failure test:
C [libcore_worker_library_java.so+0xa4be41] absl::lts_2019_08_08::Mutex::Lock()+0x1
C [libcore_worker_library_java.so+0x8669b1] boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&)+0x2c1
C [libcore_worker_library_java.so+0x8673b1] boost::asio::detail::scheduler::run(boost::system::error_code&)+0xd1
C [libcore_worker_library_java.so+0x868b7f] boost::asio::io_context::run()+0x2f
C [libcore_worker_library_java.so+0x321b44] ray::CoreWorker::RunIOService()+0xe4
In this test, system.exit is called, which will destruct the main thread.
I guess, io_worker is running in a separate thread and using the resources from the main thread, which might be the reason that leads to this segfault. Although we have functions to join these threads, it's not called in this case.
We can unblock the signals to these threads maybe to make sure they received the signal too. Or make them owned by the thread?
Is the issue that the stats data structure is getting deallocated before all callbacks have finished? A workaround might be to move that into its own data structure referenced from instrumented_io_context by a shared_ptr. Each of the posted closures should also capture a reference to that shared ptr, keeping it alive until all callbacks have finished.
@ericl Each handler stats table item is already a shared pointer for that reason, since the callbacks only need to reference that single handler stats struct (no table lookups needed), so the closure captures a reference to that shared pointer, suggesting that it should stay alive until all such callbacks have finished.
I mean including the mutex in that structure, since the deallocated mutex access is crashing.
On Tue, Mar 16, 2021, 1:17 PM Clark Zinzow @.***> wrote:
@ericl https://github.com/ericl Each handler stats table item is already a shared pointer https://github.com/ray-project/ray/blob/f30ac73640c9ed4927150024cb0726055b576112/src/ray/common/asio/instrumented_io_context.h#L108-L111 for that reason, since the callbacks only need to reference that single handler stats struct, so the closure captures a reference to that shared pointer, suggesting that it should stay alive until all such callbacks have finished.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/14715#issuecomment-800573616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSXR2YKQRIGW72ZXQBTTD64GLANCNFSM4ZJALIGA .
No such safety exists for the global stats struct, which is referenced in the handlers and could possibly be the issue.
I mean including the mutex in that structure, since the deallocated mutex access is crashing.
The handler stats table mutex isn't accessed in the handler, only the mutexes for the individual handler stats structs and the global stats struct are accessed in the handler. The former is behind a shared pointer and should be fine, but the latter is just a plain struct on the object, which could be the issue.
Exactly, we shouldn't be accessing that mutex. Can we move that mutex in the struct inside the shared ptr so it's lifetime is refcounted?
On Tue, Mar 16, 2021 at 1:24 PM Clark Zinzow @.***> wrote:
I mean including the mutex in that structure, since the deallocated mutex access is crashing.
The handler stats table mutex isn't accessed in the handler, only the mutexes for the individual handler stats structs and the global stats struct are accessed in the handler. The former is behind a shared pointer and should be fine, but the latter is just a plain struct on the object, which could be the issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/14715#issuecomment-800579002, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSQY5FQGZXG52PQXE3LTD647FANCNFSM4ZJALIGA .
@ericl Are you referring to the table-level mutex or the global stats struct mutex? I'm opening a PR right now that properly moves the latter behind a shared pointer, but I thought that you were referring to the table-level mutex above, where "stats datastructure" is referring to the handler stats table, and you were suggesting that we put the handler stats table + its mutex under a shared pointer, which shouldn't be necessary since no table lookups or accessing of the table-level mutex takes place in the handlers.
/// Protects access to the per-handler post stats table.
mutable absl::Mutex mutex_;
^ That mutex.
That mutex is never accessed in the callbacks (never locked), since that's a table-level mutex and we don't look up any table items (handler stats) in the callback. We pass the one handler stats that we need into the handler/callback, and that handler stats item is reference counted (shared pointer) and is protected by its own mutex. So I don't see why we would want to put that table-level mutex and the table behind a shared pointer.
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".