A difficult to reproduce issue is made easy to investigate by Jason's work in this branch: fixme-inputdtype.
A relatively trivial error in the input tensor's DType lead to computation to stall in about 75% of executions. In about 25% of cases it produces the expected type error.
I believe the error stems from the current implementation of AsyncSessionHandle::join_on_first_error. However, this requires more investigation into the exact cause.
The culprit is this loop:
for handle in session_handles {
let result = rt.block_on(handle.join_on_first_error());
The error happens on just one handle - the one with the input. The other two would wait. And it is not 75% I thought it was. It is a 66%!
If the first session handle we block on happens to be the one with the InputOp - we error out as expected.
If the not - we hang.
The good news is that it only affects the AsyncTextExecutor used in python bindings. Runner is fine - since it only runs one role per process.
To fix it we should switch that loop to FuturesUnordered as well.
The code inside join_on_first_error is not a problem, but we should still switch it to FuturesUnordered for a more efficient implementation.
A difficult to reproduce issue is made easy to investigate by Jason's work in this branch:
fixme-inputdtype
.A relatively trivial error in the input tensor's DType lead to computation to stall in about 75% of executions. In about 25% of cases it produces the expected type error.
I believe the error stems from the current implementation of
AsyncSessionHandle::join_on_first_error
. However, this requires more investigation into the exact cause.The culprit is this loop:
The error happens on just one handle - the one with the input. The other two would wait. And it is not 75% I thought it was. It is a 66%! If the first session handle we block on happens to be the one with the InputOp - we error out as expected. If the not - we hang.
The good news is that it only affects the AsyncTextExecutor used in python bindings. Runner is fine - since it only runs one role per process.
To fix it we should switch that loop to FuturesUnordered as well.
The code inside
join_on_first_error
is not a problem, but we should still switch it toFuturesUnordered
for a more efficient implementation.