ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.23k stars 5.81k forks source link

Fix Asyncio Event Metrics on Java #14715

Open ijrsvt opened 3 years ago

ijrsvt commented 3 years ago

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

ijrsvt commented 3 years ago

cc @iycheng Can you add some of your findings here?

rkooo567 commented 3 years ago

cc @ericl @clarkzinzow @iycheng

fishbone commented 3 years ago

It crashed at java's failure test:

C  [libcore_worker_library_java.so+0xa4be41]  absl::lts_2019_08_08::Mutex::Lock()+0x1
C  [libcore_worker_library_java.so+0x8669b1]  boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&)+0x2c1
C  [libcore_worker_library_java.so+0x8673b1]  boost::asio::detail::scheduler::run(boost::system::error_code&)+0xd1
C  [libcore_worker_library_java.so+0x868b7f]  boost::asio::io_context::run()+0x2f
C  [libcore_worker_library_java.so+0x321b44]  ray::CoreWorker::RunIOService()+0xe4

In this test, system.exit is called, which will destruct the main thread.

I guess, io_worker is running in a separate thread and using the resources from the main thread, which might be the reason that leads to this segfault. Although we have functions to join these threads, it's not called in this case.

We can unblock the signals to these threads maybe to make sure they received the signal too. Or make them owned by the thread?

ericl commented 3 years ago

Is the issue that the stats data structure is getting deallocated before all callbacks have finished? A workaround might be to move that into its own data structure referenced from instrumented_io_context by a shared_ptr. Each of the posted closures should also capture a reference to that shared ptr, keeping it alive until all callbacks have finished.

clarkzinzow commented 3 years ago

@ericl Each handler stats table item is already a shared pointer for that reason, since the callbacks only need to reference that single handler stats struct (no table lookups needed), so the closure captures a reference to that shared pointer, suggesting that it should stay alive until all such callbacks have finished.

ericl commented 3 years ago

I mean including the mutex in that structure, since the deallocated mutex access is crashing.

On Tue, Mar 16, 2021, 1:17 PM Clark Zinzow @.***> wrote:

@ericl https://github.com/ericl Each handler stats table item is already a shared pointer https://github.com/ray-project/ray/blob/f30ac73640c9ed4927150024cb0726055b576112/src/ray/common/asio/instrumented_io_context.h#L108-L111 for that reason, since the callbacks only need to reference that single handler stats struct, so the closure captures a reference to that shared pointer, suggesting that it should stay alive until all such callbacks have finished.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/14715#issuecomment-800573616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSXR2YKQRIGW72ZXQBTTD64GLANCNFSM4ZJALIGA .

clarkzinzow commented 3 years ago

No such safety exists for the global stats struct, which is referenced in the handlers and could possibly be the issue.

clarkzinzow commented 3 years ago

I mean including the mutex in that structure, since the deallocated mutex access is crashing.

The handler stats table mutex isn't accessed in the handler, only the mutexes for the individual handler stats structs and the global stats struct are accessed in the handler. The former is behind a shared pointer and should be fine, but the latter is just a plain struct on the object, which could be the issue.

ericl commented 3 years ago

Exactly, we shouldn't be accessing that mutex. Can we move that mutex in the struct inside the shared ptr so it's lifetime is refcounted?

On Tue, Mar 16, 2021 at 1:24 PM Clark Zinzow @.***> wrote:

I mean including the mutex in that structure, since the deallocated mutex access is crashing.

The handler stats table mutex isn't accessed in the handler, only the mutexes for the individual handler stats structs and the global stats struct are accessed in the handler. The former is behind a shared pointer and should be fine, but the latter is just a plain struct on the object, which could be the issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/14715#issuecomment-800579002, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSQY5FQGZXG52PQXE3LTD647FANCNFSM4ZJALIGA .

clarkzinzow commented 3 years ago

@ericl Are you referring to the table-level mutex or the global stats struct mutex? I'm opening a PR right now that properly moves the latter behind a shared pointer, but I thought that you were referring to the table-level mutex above, where "stats datastructure" is referring to the handler stats table, and you were suggesting that we put the handler stats table + its mutex under a shared pointer, which shouldn't be necessary since no table lookups or accessing of the table-level mutex takes place in the handlers.

ericl commented 3 years ago
  /// Protects access to the per-handler post stats table.
  mutable absl::Mutex mutex_;

^ That mutex.

clarkzinzow commented 3 years ago

That mutex is never accessed in the callbacks (never locked), since that's a table-level mutex and we don't look up any table items (handler stats) in the callback. We pass the one handler stats that we need into the handler/callback, and that handler stats item is reference counted (shared pointer) and is protected by its own mutex. So I don't see why we would want to put that table-level mutex and the table behind a shared pointer.