We propose to set the default value of the GRPC_FAIL_FAST environment variable to use_caller. This change prevents TensorFlow distributed jobs from hanging indefinitely due to task failures, and allows users and TF libraries (e.g., distribution strategies) to handle the connection errors for better failure and preemption recovery.
The feedback phase will be open for two weeks until 2021-03-18
Objective
We propose to set the default value of the
GRPC_FAIL_FAST
environment variable touse_caller
. This change prevents TensorFlow distributed jobs from hanging indefinitely due to task failures, and allows users and TF libraries (e.g., distribution strategies) to handle the connection errors for better failure and preemption recovery.