tensorflow / community

Stores documents used by the TensorFlow developer community
Apache License 2.0
1.26k stars 576 forks source link

RFC: Setting GRPC_FAIL_FAST to use_caller by Default #355

Closed haoyuz closed 3 years ago

haoyuz commented 3 years ago

The feedback phase will be open for two weeks until 2021-03-18

Status Approved
RFC # 355
Author(s) Haoyu Zhang (haoyuzhang@google.com)
Sponsor Bramandia Ramadhana (bramandia@google.com)
Updated 2021-03-04

Objective

We propose to set the default value of the GRPC_FAIL_FAST environment variable to use_caller. This change prevents TensorFlow distributed jobs from hanging indefinitely due to task failures, and allows users and TF libraries (e.g., distribution strategies) to handle the connection errors for better failure and preemption recovery.

ematejska commented 3 years ago

This has been approved.