tf-encrypted / moose

Secure distributed dataflow framework for encrypted machine learning and data processing
Apache License 2.0
58 stars 16 forks source link

AsyncSession execition may hang instead of halt on error #1022

Closed voronaam closed 2 years ago

voronaam commented 2 years ago

A difficult to reproduce issue is made easy to investigate by Jason's work in this branch: fixme-inputdtype.

A relatively trivial error in the input tensor's DType lead to computation to stall in about 75% of executions. In about 25% of cases it produces the expected type error.

I believe the error stems from the current implementation of AsyncSessionHandle::join_on_first_error. However, this requires more investigation into the exact cause.

The culprit is this loop:

for handle in session_handles {
    let result = rt.block_on(handle.join_on_first_error());

The error happens on just one handle - the one with the input. The other two would wait. And it is not 75% I thought it was. It is a 66%! If the first session handle we block on happens to be the one with the InputOp - we error out as expected. If the not - we hang.

The good news is that it only affects the AsyncTextExecutor used in python bindings. Runner is fine - since it only runs one role per process.

To fix it we should switch that loop to FuturesUnordered as well.

The code inside join_on_first_error is not a problem, but we should still switch it to FuturesUnordered for a more efficient implementation.